
Building a secure cloud-to-on-premises data pipeline and diagnosing the routing failure that only showed up in production
Modern data engineering defaults to cloud-native infrastructure. Managed compute, object storage, serverless pipelines. It’s the natural starting point. But operational data doesn’t always cooperate with that assumption.
Some of the most valuable data for organizations still lives on-premises, maintained by third-party vendors with strict access controls and reachable only through tightly governed network paths. Moving that database isn’t on the table. You work with it where it is.
For one of our sports clients, we built a production ETL pipeline that runs on AWS ECS (Fargate), connects to an on-premises SQL Anywhere database hosted by a third-party data provider, extracts data, and loads it into cloud storage for downstream transformation. Fully private, no public internet exposure, no database migration.
This is how we built it, and what we learned along the way.
Three distinct network environments are involved:
AWS (Production VPC) ECS tasks run inside a private VPC with no direct internet access. Route tables direct traffic destined for on-premises network ranges toward a Virtual Private Gateway (VGW), while everything else flows through a NAT Gateway.
Client’s On-Premises Router A Cisco router on the client’s network terminates the VPN tunnels from AWS and acts as the bridge into the internal network. It holds the static routes that determine how return traffic gets sent back to AWS.
Data Provider Network The SQL Anywhere database lives inside the data provider’s network. It has no custom routing configuration and relies entirely on its default gateway, which points to the client’s router. This becomes important when things go wrong.
The network flow looks like this:
VPC and Subnets
ECS tasks run in private subnets with no public IP assignment. All traffic is controlled through route tables and security groups.
Route Table
Traffic bound for on-premises CIDR ranges routes explicitly to the VGW rather than the NAT Gateway. This is what initiates the VPN path instead of attempting a public internet connection.
Security Group
ECS tasks are permitted outbound TCP on port 50561 scoped to the data provider's network ranges. No broad egress rules.
VPC Flow Logs
Enabled from the start. This turned out to be the most valuable tool we had when troubleshooting.
Site-to-Site VPN
We provision two tunnels for redundancy. The Virtual Private Gateway attaches to the VPC and ties into the route table entries above. The customer gateway is configured against the client router’s public IP.
The Fargate task runs a Python ETL process that connects to SQL Anywhere via ODBC using the sqlanydb driver, executes extraction queries, and writes results to S3 for downstream ingestion into Snowflake.
The connection targets the database’s internal IP over port the third party’s port. Because the ECS task sits in a private subnet with correct route table entries, the VPN path is fully transparent to the application.
Data moves from ingestion through to transformation in Snowflake in under 7 minutes end-to-end, traversing three separate network environments in the process. Our initial run took over 40 minutes (slow database), we were able to bring this down to under 7 minutes, this blog focuses on the infra not the pipeline, so we’ll save the details for another time.
Debugging connection issues across environments…attempted connections initially timed out on every attempt. No error, just silence. And the AWS console showed everything as healthy.
This is where multi-environment, multi-network architectures get interesting. The failure was not in AWS. The AWS side was configured correctly. VPN tunnels were up, route tables were correct, security groups were clean. The console had nothing useful to offer because from AWS’s perspective, traffic was leaving as expected.
The problem lived outside AWS entirely, in the routing configuration between two network environments that AWS had no visibility into. Finding it required looking at the data that AWS does surface but that most teams never think to check until something breaks.
What the traffic data told us
When you are debugging connectivity across a VPN and the standard checks come back clean, the question to ask is not whether traffic is leaving your environment. It is whether traffic is making it back. Those are two different problems with two different diagnostic approaches, and conflating them is where most teams lose time.
The symptom was asymmetric routing. Traffic was reaching its destination. The response was not completing the return journey. The two sides of the connection were operating on different routing assumptions, and nothing in the console was going to surface that.
Where the gap lived
In a multi-environment setup, network configuration on the on-premises side has to be updated to reflect every environment AWS provisions. This is not something AWS manages or prompts for. It is a coordination requirement that lives outside the platform entirely, and it is easy to miss when dev and prod are provisioned at different points in time.
The on-premises router knew how to return traffic to the development environment. It had no corresponding instruction for production. One environment worked. The other dropped return packets silently.
What this class of failure looks like
Traffic leaves AWS. Nothing comes back. No error, no timeout message from the router, no log entry on the AWS side indicating a problem. The connection just hangs until it times out at the application layer.
If you are seeing this pattern, the problem is almost certainly not in AWS. It is in the return path on the other side of the tunnel, and the place to look is routing configuration on the on-premises network.
The AWS console is not the full picture
In hybrid network architectures, AWS surfaces what happens inside AWS. What happens on the other side of the VPN tunnel is outside its visibility. Debugging requires looking at both ends independently.
Dev and prod are separate network configurations, not just separate environments
Any time a new VPC is provisioned, every network environment it needs to communicate with has to be updated to reflect it. That coordination does not happen automatically and it does not live in your Terraform.
Connectivity testing in hybrid environments requires the right tools
Common approaches to connectivity testing do not always work reliably across VPN boundaries. Using the right diagnostic layer from the start saves significant time when something goes wrong.
Asymmetric routing is the most common failure mode
One-way traffic is a category of problem, not a single root cause. Recognizing the pattern early points you toward the right part of the architecture to investigate rather than spending time eliminating the wrong suspects.
Production failures in hybrid architectures are coordination failures as much as technical ones
The fix in our case was not a code change or an infrastructure change on the AWS side. It was a communication between two teams about a configuration that needed to be updated. Building processes that account for that coordination is as important as the technical architecture itself.
Site-to-Site VPN gives us a repeatable, secure pattern for extending cloud pipelines into on-premises environments without requiring the database to move or exposing it to the public internet. The tradeoff is operational coordination across network boundaries. Changes on one side have to be reflected on the other.
The broader principle holds beyond this specific stack: whenever two network environments are bridged, verifying the full path end-to-end is not optional. Confirming that traffic leaves your environment is not the same as confirming the connection works.
Need help setting up secure hybrid cloud in your environment? Drop us a note we’d be happy to help.