Let's talk about getting data into Snowflake
Snowflake supports a range of ingestion options for getting data into the platform — from simple batch uploads to real-time streaming. In this guide, we break down the most common approaches and when to use each one based on your data latency, volume, and operational needs.
Batch Ingestion with COPY INTO
Batch ingestion via COPY INTO
is the most direct way to load data from cloud storage into Snowflake, especially for infrequent, high-volume jobs.
Use for:
- Daily or nightly data loads from source systems
- One-time historical backfills
How it works:
- Data is exported to a cloud stage (e.g., S3) on a recurring schedule
- Snowflake loads the files when triggered by a task or manually
Example:
COPY INTO my_table
FROM @my_stage/path/
FILE_FORMAT = (TYPE = CSV FIELD_OPTIONALLY_ENCLOSED_BY='"');
File size considerations:
- Ideal file size is 100–250 MB compressed (e.g., gzip)
- Too many small files (<10 MB) can lead to performance degradation
- Too few large files can underutilize parallelism
COPY INTO
is the most common method for loading bulk data from files in cloud storage (S3, GCS, or Azure Blob).
Continuous Ingestion with Snowpipe
Snowpipe is designed for micro-batch ingestion — typically loading files within minutes of landing in your cloud stage.
Use for:
- Hourly or frequent batch jobs where latency matters
- Near real-time pipelines where low complexity and autoscaling are a priority
How it works:
- Uses event notifications (ex. AWS SNS/SQS events) to trigger ingest
- Snowflake automatically parses and loads files
Benefits:
- Serverless and auto-scaling
- Supports auto-ingest and manual trigger modes
Example:
CREATE PIPE my_pipe AS
COPY INTO my_table
FROM @my_stage/path
FILE_FORMAT = (TYPE = JSON);
Real-Time Streaming with Snowpipe Streaming
Snowpipe Streaming is designed for low-latency, event-level ingestion without relying on cloud file staging. It uses the Snowflake Streaming API to insert records directly into Snowflake tables from your application or data pipeline.
Use for:
- Sub-second ingestion requirements
- High-volume, append-only workloads
- Event stream processing (e.g., Kafka consumers, Flink, Spark)
How it works:
- A client app or connector uses the Snowpipe Streaming API
- Records are written directly to Snowflake with commit ordering
- Data is instantly queryable with low-latency
Benefits:
- True streaming without staging files
- Lower latency than Snowpipe
- Strong ordering guarantees within partitions
Considerations:
- Requires use of supported SDK (Java or Snowflake connector libraries)
- Costs are based on volume ingested and latency tier, but have proven to be cost effective
Ingestion with Snowflake Native Connectors (OpenFlow)
OpenFlow is Snowflake’s native connector framework designed for high-scale, low-latency ingestion across various sources — including structured, semi-structured, and unstructured data.
Use for:
- Ingesting unstructured and semi-structured files from business systems like Microsoft SharePoint
- Integrating directly with REST APIs, enterprise applications, and third-party services
- Avoiding manual staging or orchestration when pulling from common SaaS sources
Benefits:
- Fully managed by Snowflake
- Native support for document-based formats and unstructured data
- Ideal for operational pipelines with enterprise file systems like SharePoint or OneDrive
Example: Ingesting SharePoint data using an OpenFlow connector configured through Snowflake’s native connector interface (GUI or SQL)
There’s no need to define external tables manually — OpenFlow handles the connection, extraction, and ingestion pipeline internally, including schema inference and update propagation.
OpenFlow continues to evolve but already offers first-class ingestion paths from commonly used tools in the Microsoft ecosystem and beyond.
External Tables for Data Lakes
While external tables don’t technically load data into Snowflake, they are often used as a precursor to ingestion or for staging data before transformation.
Use external tables to read data directly from S3, ADLS, or GCS without loading it into Snowflake storage.
Use for:
- Querying data lakes
- Minimizing storage duplication
- Staging before ingest
Example:
CREATE EXTERNAL TABLE raw_events (
event_id STRING,
event_ts TIMESTAMP,
payload VARIANT
)
LOCATION = '@my_ext_stage/events/'
AUTO_REFRESH = TRUE
FILE_FORMAT = (TYPE = PARQUET);
Caveats:
- Slightly slower query performance
- Requires optimal data lake design
Third-Party Connectors and ELT Tools
Third-party tools can simplify ingestion workflows through either batch or streaming pipelines.
Batch Connectors:
These tools schedule regular syncs and handle incremental replication, schema evolution, and source-specific logic.
Streaming Connectors:
- Confluent (Kafka-based)
- Estuary Flow
- Redpanda
Streaming tools support high-frequency, low-latency updates and typically land data into Snowflake via Snowpipe Streaming.
Open Source Alternatives:
- Batch: Meltano, Airbyte, dlt
- Streaming: Kafka, Debezium
These options are more customizable and cost-effective, but require additional setup and monitoring. Self-hosting can be complex to manage without a dedicated team.
Benefits:
- Prebuilt connectors to hundreds of sources
- Handles authentication, pagination, and incremental syncs
Best for:
- SaaS platforms (Salesforce, Stripe, Shopify, etc.)
- Streaming tools (Kafka → Snowflake)
Most of these tools land data in staging schemas, which you can transform further using dybamic tables, streams and task, or dbt.
Choosing the Right Ingestion Pattern
Use Case |
Recommended Method |
Daily batch files |
COPY INTO |
Microbatch / near real-time |
Snowpipe |
Real-time streaming events |
Snowpipe Streaming |
Direct data lake querying |
External Tables |
SaaS connectors / APIs |
Estuary, Fivetran, etc. |
Unstructured data |
Openflow |
File Formats for Ingestion
Snowflake supports a variety of structured and semi-structured file types:
Common Formats:
- CSV: Simple but lacks schema flexibility, best with headers and consistent types
- JSON: Great for nested or flexible schemas, works well with VARIANT
- Parquet: Columnar format optimized for performance, especially in external tables, schema evolution support
- Avro: Schema evolution support, used in Kafka pipelines
Choose formats based on performance, storage size, and compatibility with source systems.
Summary
- Use Snowpipe Streaming for event-driven, low-latency pipelines
- Use COPY INTO or Snowpipe for file-based ingestion
- Use OpenFlow for unstructured data
- Use external tables to query S3/ADLS without loading
- Use third-party tools for complex sources or turnkey ELT
What’s Next
In our next blog we'll walk you through how to build continuous data pipelines orchestrated through serverless Snowflake tools and deployed via Terraform.
Need help selecting or implementing the right pattern? Reach out — we help teams design reliable Snowflake pipelines across all ingestion types.