Data Ingestion in Snowflake: From Batch to Real-Time Streams

Let's talk about getting data into Snowflake

Snowflake supports a range of ingestion options for getting data into the platform — from simple batch uploads to real-time streaming. In this guide, we break down the most common approaches and when to use each one based on your data latency, volume, and operational needs.

Batch Ingestion with COPY INTO

Batch ingestion via COPY INTO is the most direct way to load data from cloud storage into Snowflake, especially for infrequent, high-volume jobs.

Use for:

Daily or nightly data loads from source systems
One-time historical backfills

‍

How it works:

Data is exported to a cloud stage (e.g., S3) on a recurring schedule
Snowflake loads the files when triggered by a task or manually

‍

Example:

COPY INTO my_table
FROM @my_stage/path/
FILE_FORMAT = (TYPE = CSV FIELD_OPTIONALLY_ENCLOSED_BY='"');

‍

File size considerations:

Ideal file size is 100–250 MB compressed (e.g., gzip)
Too many small files (<10 MB) can lead to performance degradation
Too few large files can underutilize parallelism

‍

COPY INTO is the most common method for loading bulk data from files in cloud storage (S3, GCS, or Azure Blob).

Continuous Ingestion with Snowpipe

Snowpipe is designed for micro-batch ingestion — typically loading files within minutes of landing in your cloud stage.

Use for:

Hourly or frequent batch jobs where latency matters
Near real-time pipelines where low complexity and autoscaling are a priority

‍

How it works:

Uses event notifications (ex. AWS SNS/SQS events) to trigger ingest
Snowflake automatically parses and loads files

‍

Benefits:

Serverless and auto-scaling
Supports auto-ingest and manual trigger modes

‍

Example:

CREATE PIPE my_pipe AS
COPY INTO my_table
FROM @my_stage/path
FILE_FORMAT = (TYPE = JSON);

Real-Time Streaming with Snowpipe Streaming

Snowpipe Streaming is designed for low-latency, event-level ingestion without relying on cloud file staging. It uses the Snowflake Streaming API to insert records directly into Snowflake tables from your application or data pipeline.

Use for:

Sub-second ingestion requirements
High-volume, append-only workloads
Event stream processing (e.g., Kafka consumers, Flink, Spark)

‍

How it works:

A client app or connector uses the Snowpipe Streaming API
Records are written directly to Snowflake with commit ordering
Data is instantly queryable with low-latency

‍

Benefits:

True streaming without staging files
Lower latency than Snowpipe
Strong ordering guarantees within partitions

‍

Considerations:

Requires use of supported SDK (Java or Snowflake connector libraries)
Costs are based on volume ingested and latency tier, but have proven to be cost effective

Ingestion with Snowflake Native Connectors (OpenFlow)

OpenFlow is Snowflake’s native connector framework designed for high-scale, low-latency ingestion across various sources — including structured, semi-structured, and unstructured data.

Use for:

Ingesting unstructured and semi-structured files from business systems like Microsoft SharePoint
Integrating directly with REST APIs, enterprise applications, and third-party services
Avoiding manual staging or orchestration when pulling from common SaaS sources

‍

Benefits:

Fully managed by Snowflake
Native support for document-based formats and unstructured data
Ideal for operational pipelines with enterprise file systems like SharePoint or OneDrive

‍

Example: Ingesting SharePoint data using an OpenFlow connector configured through Snowflake’s native connector interface (GUI or SQL)

There’s no need to define external tables manually — OpenFlow handles the connection, extraction, and ingestion pipeline internally, including schema inference and update propagation.

OpenFlow continues to evolve but already offers first-class ingestion paths from commonly used tools in the Microsoft ecosystem and beyond.

External Tables for Data Lakes

While external tables don’t technically load data into Snowflake, they are often used as a precursor to ingestion or for staging data before transformation.

Use external tables to read data directly from S3, ADLS, or GCS without loading it into Snowflake storage.

Use for:

Querying data lakes
Minimizing storage duplication
Staging before ingest

‍

Example:

CREATE EXTERNAL TABLE raw_events (
  event_id STRING,
  event_ts TIMESTAMP,
  payload VARIANT
)
LOCATION = '@my_ext_stage/events/'
AUTO_REFRESH = TRUE
FILE_FORMAT = (TYPE = PARQUET);

‍

Caveats:

Slightly slower query performance
Requires optimal data lake design

Third-Party Connectors and ELT Tools

Third-party tools can simplify ingestion workflows through either batch or streaming pipelines.

Batch Connectors:

Fivetran
Stitch
Estuary

‍

These tools schedule regular syncs and handle incremental replication, schema evolution, and source-specific logic.

Streaming Connectors:

Confluent (Kafka-based)
Estuary Flow
Redpanda

‍

Streaming tools support high-frequency, low-latency updates and typically land data into Snowflake via Snowpipe Streaming.

Open Source Alternatives:

Batch: Meltano, Airbyte, dlt
Streaming: Kafka, Debezium

‍

These options are more customizable and cost-effective, but require additional setup and monitoring. Self-hosting can be complex to manage without a dedicated team.

Benefits:

Prebuilt connectors to hundreds of sources
Handles authentication, pagination, and incremental syncs

‍

Best for:

SaaS platforms (Salesforce, Stripe, Shopify, etc.)
Streaming tools (Kafka → Snowflake)

‍

Most of these tools land data in staging schemas, which you can transform further using dybamic tables, streams and task, or dbt.

Choosing the Right Ingestion Pattern

Use Case	Recommended Method
Daily batch files	COPY INTO
Microbatch / near real-time	Snowpipe
Real-time streaming events	Snowpipe Streaming
Direct data lake querying	External Tables
SaaS connectors / APIs	Estuary, Fivetran, etc.
Unstructured data	Openflow

File Formats for Ingestion

Snowflake supports a variety of structured and semi-structured file types:

Common Formats:

CSV: Simple but lacks schema flexibility, best with headers and consistent types
JSON: Great for nested or flexible schemas, works well with VARIANT
Parquet: Columnar format optimized for performance, especially in external tables, schema evolution support
Avro: Schema evolution support, used in Kafka pipelines

‍

Choose formats based on performance, storage size, and compatibility with source systems.

Summary

Use Snowpipe Streaming for event-driven, low-latency pipelines
Use COPY INTO or Snowpipe for file-based ingestion
Use OpenFlow for unstructured data
Use external tables to query S3/ADLS without loading
Use third-party tools for complex sources or turnkey ELT

What’s Next

In our next blog we'll walk you through how to build continuous data pipelines orchestrated through serverless Snowflake tools and deployed via Terraform.

Need help selecting or implementing the right pattern? Reach out — we help teams design reliable Snowflake pipelines across all ingestion types.

Data Ingestion in Snowflake: From Batch to Real-Time Streams

Let's talk about getting data into Snowflake

Batch Ingestion with COPY INTO

Continuous Ingestion with Snowpipe

Real-Time Streaming with Snowpipe Streaming

Ingestion with Snowflake Native Connectors (OpenFlow)

External Tables for Data Lakes

Third-Party Connectors and ELT Tools

Choosing the Right Ingestion Pattern

File Formats for Ingestion

Summary

What’s Next

More blog posts

Thinking Like a Software Engineer: A Software-Driven Data Modelling Approach to Customer 360 for Sports Organizations

What Is a Customer 360?

RudderStack SDK vs your Marketing Platform SDK: What should you do?