Engineering

From Python to Go: Why We Rewrote Our Ingest Pipeline at Telemetry Harbor

We rewrote Telemetry Harbor’s ingest pipeline from Python FastAPI to Go after hitting severe performance limits. The switch delivered 10x efficiency, improved data integrity with strict typing, and gave us a stable, scalable foundation for high-volume time-series data ingestion.

At Telemetry Harbor, we learned the hard way that choosing the right technology for prototyping isn't always the right choice for production. After months of struggling with Python's performance limitations in our data ingestion pipeline, we made the decision to rewrite our core ingest system in Go. Here's the story of why we made this change, the challenges we faced, and the dramatic performance improvements we achieved.

The Context: Building a Time-Series Data Platform

Telemetry Harbor was born from our team's extensive experience in the automotive industry, where we repeatedly built the same infrastructure for time-series data projects: database setup, backend architecture, data ingestion, and visualization. Every project started with this same foundation before branching into ML, analytics, or alerting. This repetitive cycle was frustrating we were essentially rebuilding the same wheel for every client, spending weeks on infrastructure that should have been commoditized.

The breaking point came in October 2024 when a friend reached out with an observation that would change everything. He saw an opportunity to create a platform that was truly ready-to-go: users could simply sign up, push their data, and we'd handle all the complexity behind the scenes. At first, we thought, "Okay, InfluxDB already has something like this with their cloud offering," but as we dug deeper, we identified a significant gap we could fill.

InfluxDB's reputation had taken a serious hit after they moved critical features behind paywalls, introduced many new versions with breaking changes that forced costly migrations, and from our direct experience, InfluxDB simply didn't handle big data well at all. We'd seen it crash, fail to start, and generally buckle under the kind of data loads our automotive clients regularly threw at time-series systems. TimescaleDB and ClickHouse were technically solid databases, but they still left you with the fundamental problem: you had to create your own backend and build your entire ingestion pipeline from scratch. There was no plug-and-play solution.

We started working on a proof of concept, and after exploring different approaches and getting some feedback, we made a pivotal decision. We would create our own version where we could build it the way we envisioned, while still maintaining the collaborative relationship that sparked the idea in the first place.

Our differentiator would be radical simplicity: a complete platform with seamless Grafana integration, eliminating the infrastructure overhead that kept teams from focusing on their actual data insights. We wanted to solve the deployment complexity, the database maintenance headaches, and the connection challenges that forced users to become infrastructure experts when they just wanted to analyze their telemetry data.

Python FastAPI: The Right Choice for Prototyping

When we started building our MVP, we faced a classic startup dilemma: optimize for development speed or runtime performance? This wasn't a decision we took lightly we spent considerable time researching our options, weighing the trade-offs between different approaches.

The choice came down to a fundamental strategic question: either we could go for something lightning-fast like Go or Rust, accepting that we'd have to deal with the language paradigms and learning curves that would inevitably slow us down, or we could "go crazy" with Python FastAPI to achieve rapid prototyping, deep market understanding, continuous feedback loops, and the agility to quickly create, adjust, and fulfill evolving customer needs.

We chose development speed. Our engineering team selected Python FastAPI because we needed to:

Validate our market assumptions quickly before competitors moved in
Gather customer feedback and iterate rapidly on features
Test different approaches without committing to massive rewrites
Get to market fast enough to establish ourselves in the space

From our deep industry experience, we knew TimescaleDB would be an excellent fit for our time-series requirements. It's built on PostgreSQL, which our team was intimately familiar with, it scales beautifully, and since we were planning to deploy on Kubernetes, there were plenty of mature operators available to simplify our infrastructure management.

The initial architecture was deliberately straightforward: HTTP-first communication (since many enterprises have very strict firewall rules that block MQTT and other IoT protocols), Redis with RQ workers for payload queuing and consumption, and TimescaleDB for time-series storage. We deployed to a small Kubernetes cluster and started testing with friends and early users. The results looked promising, and our beta users were genuinely happy with the functionality.

However, even in these early stages, we noticed something concerning: RQ workers are synchronous by design, meaning they process payloads sequentially, one by one. This wasn't going to be good for scalability or IoT data volumes. For context, in the automotive industry where we'd cut our teeth, we regularly sampled data every 60 milliseconds. Sequential database writes simply weren't going to cut it for that kind of throughput, though granted, our target for Telemetry Harbor was more forgiving we were aiming to handle 1-second sampling rates effectively, not millisecond-scale precision.

The Performance Wall: When Python Couldn't Keep Up

As our user base grew, we started hitting serious performance bottlenecks. The problems weren't immediately obvious they emerged gradually as data volumes increased, like a slow-building pressure that eventually becomes impossible to ignore.

The RQ Worker Bottleneck

Our first major scaling attempt involved implementing multi-threading with supervisor and scaling up our worker processes horizontally. While this approach did push more data through the system and improved our overall throughput, RQ proved to be fundamentally inefficient at handling the kind of loads we were seeing.

The performance numbers were frankly alarming:

Python Performance Profile:
• Idle CPU: 10%
• Medium load: ~40% CPU
• Heavy load: 120-300% CPU (peaks at 800%)
• Result: Service crashes, 500 errors cascade

What made this particularly frustrating was that we weren't even handling massive amounts of data yet. This level of resource consumption for relatively modest workloads was completely unsustainable.

The Reality Check

Our comprehensive stress testing revealed the harsh truth about our Python implementation. The system would crash under very light connection loads, loads that should have been trivial for a production-ready service. Redis memory usage would spike unpredictably, Python processes would terminate unexpectedly under moderate stress, and 500 errors would start cascading through the system like dominoes falling.

This was our wake-up call. If we wanted to build a platform that customers would actually pay for a service they could depend on for their critical telemetry data we needed something fundamentally more robust and efficient. The Python implementation might have been perfect for validation and early prototyping, but it was becoming clear that it would never scale to meet real-world production demands.

We realized that the sequential processing limitation of RQ workers was just the tip of the iceberg. The entire Python-based architecture was struggling under loads that our future customers would consider routine. We needed to make a difficult but necessary decision: it was time to consider a complete rewrite.

The Migration Decision: Why Go?

Faced with the clear limitations of our Python implementation, we carefully evaluated our migration options. This wasn't a decision we could afford to get wrong choosing the right technology would determine whether we could deliver on our promises to customers.

We had two serious candidates: Rust and Golang.

Rust offered maximum performance and memory safety exactly what we needed for high-throughput data processing. However, Rust came with a steep learning curve and complex language paradigms that would significantly slow down our development velocity. At our current scale, while we absolutely wanted efficiency, development speed and our ability to adapt quickly to market feedback and evolving customer needs remained our top priority. We couldn't afford to offer a feature, realize our users didn't like it, then spend months trying to modify it because of language complexity.

Go provided what seemed like the sweet spot: near-Rust performance levels with much simpler development workflows and faster iteration cycles. Since we were still actively iterating based on customer feedback and market signals, maintaining development velocity was crucial for our survival as a platform.

The decision became clear when we framed it strategically: Go would give us production-ready performance without sacrificing our ability to adapt quickly to changing requirements. It was the best of both worlds for our current situation super efficient in terms of resource usage, yet simple enough to implement and modify rapidly as we continued to refine our product-market fit.

This wasn't just about technical performance metrics; it was about positioning ourselves to succeed in a competitive market where the ability to respond quickly to customer needs could make the difference between success and failure.

Rewriting the Ingest Pipeline in Go

Development Phase

We got to work converting our Python ingest endpoint to Go. While it was challenging to figure things out at the beginning choosing frameworks, understanding Go idioms, architecting for performance we leaned heavily on research and, honestly, AI assistance to get something functional up and running quickly.

The initial Go code wasn't pretty or production-ready, but it was something we could test and validate our approach with. Through our research, we selected Fiber as our Go framework. Fiber marketed itself as one of the fastest Go web frameworks available, with wide support specifically for high-throughput data applications exactly what our ingest pipeline needed to handle the volume of telemetry data we were targeting.

Our migration started with a focused prototype that proved our architectural assumptions were sound. Local testing immediately showed promise we could see Go's performance advantages even in our rough, hastily-written prototype code. This gave us the confidence to invest in a complete rewrite rather than trying to patch our Python implementation.

Production Implementation

We built the production Go service with Docker and made absolutely sure to implement backward compatibility. This was crucial we couldn't afford to break existing customer integrations during the transition.

Our API versioning strategy was designed to minimize customer disruption while encouraging migration to the superior implementation:

Existing users: https://telemetryharbor.com/api/v1/ingest/Harbor_id (Python)
New implementations: https://telemetryharbor.com/api/v2/ingest/Harbor_id (Go)

Go was remarkably easy to work with during this process, making the transition much smoother than we had anticipated. The language's simplicity and excellent tooling meant we could move fast without sacrificing code quality.

The Performance Results Were Dramatic

When we deployed and started testing, the results were genuinely shocking:

Go Performance Profile:
• Idle CPU: 1% (down from 10%)
• Heavy load: ~60% CPU (stable and predictable)
• No more crashes or cascading failures
• 10x efficiency improvement over Python baseline

The difference was night and day. Where our Python implementation would spike to 800% CPU usage and crash the entire service, our Go implementation remained stable and predictable under identical loads. But the real revelation was that CPU utilization wasn't even the bottleneck anymore Go was so efficient that we'd essentially eliminated the application layer as a constraint on our system's performance.

Engineering Challenge: Handling Database Constraints

During our Go rewrite, we encountered a critical challenge that perfectly illustrated the kind of robust solutions our new architecture enabled us to implement efficiently.

The challenge centered around our data integrity model. Telemetry Harbor has a fundamental constraint: we prevent customers from ingesting conflicting values for the same ship, cargo, and timestamp combination. This constraint is essential for preventing duplicates and maintaining data quality it's part of our user-friendly philosophy where we handle data validation complexity behind the scenes so our customers don't have to worry about it.

The problem emerged during our batch processing implementation: if any single record in a batch violated our uniqueness constraint, PostgreSQL would reject the entire batch. This created an impossible trade-off we could process records individually to maintain data integrity, but that would be far too slow for high-volume data ingestion. Alternatively, we could batch process for speed, but risk losing entire batches due to single constraint violations.

Our Solution: Two-Stage Insertion

We implemented an elegant two-stage insertion strategy that solved both problems:

Stage 1: Batch copy all incoming records to a temporary table
Stage 2: Let PostgreSQL intelligently select and insert only the valid records from the temporary table into the production table

This approach optimized both batch and individual insert performance while maintaining our strict data integrity requirements. The temporary table acts as a staging area where we can safely attempt bulk operations, and then PostgreSQL's native conflict resolution handles the constraint checking efficiently.

The beauty of this solution was that our batch processing became not just super fast, but also more solid and reliable. Even individual inserts benefited from this architecture because the constraint checking was moved to the database level where it could be handled more efficiently than in application logic.

This was exactly the kind of robust, performance-oriented solution that our Go rewrite enabled us to implement efficiently. The predictable performance characteristics and lower resource overhead of Go gave us the headroom to implement more sophisticated data handling strategies.

The New Bottleneck: Database Performance

With our ingest pipeline now running with remarkable efficiency, an interesting shift occurred in our system's performance profile. PostgreSQL started climbing in CPU and RAM utilization as data volumes increased, while Redis continued to show solid stability throughout our testing.

Honestly, we were surprised that Redis kept it together so well under the increased load—it proved to be remarkably resilient and performant even as we pushed more data through the system. But PostgreSQL was now showing significant resource spikes as it became the receiving end of our much more efficient data ingestion pipeline.

This was actually fantastic news from an architectural perspective. We had successfully moved the bottleneck from our application layer (where we had limited scaling options and unpredictable performance) to the database layer, where we had many more scaling strategies available and could implement solutions like sharding, read replicas, and horizontal partitioning.

While we can certainly scale our current database setup more aggressively, this is also an ideal time to consider implementing a sharding strategy sooner rather than later. The database becoming our primary constraint indicated that our Go migration had been successful we'd eliminated the application-level performance issues that were preventing us from reaching our system's true capacity limits.

A Surprising Discovery: Pydantic's Type Coercion

During our canary deployment of the Go endpoint, we noticed something concerning: the 400 error rates were significantly higher than expected. This wasn't just a small increase, it was way more than our normal baseline rate. For reference, we always had some users pushing too much data or using wrong formats (that's just the reality of working with diverse clients), but the error rate from our Go ingest endpoint was substantially higher than what we considered normal.

This discrepancy demanded investigation. We dug deeper into the failing payloads, examining the specific data that was being rejected by our Go implementation but had previously been accepted by our Python system. What we discovered was frankly shocking, though we acknowledge we could be wrong in our testing methodology if anyone has different insights, please let us know in the comments.

The Type Coercion Problem

Despite our Pydantic models explicitly specifying float types for numerical values, the Python endpoint was actually accepting values that should have been rejected outright. For example:

Boolean coercion: If someone submitted True or False (standard Python boolean values), Pydantic would silently interpret these as 0 and 1 respectively. There was no validation error, no warning—just automatic conversion.
String-to-number coercion: When users submitted numbers as strings (like "123.45"), Pydantic would automatically convert them to floats or integers without any validation warnings or errors.

We did research into this behavior and discovered that this was actually expected behavior according to Pydantic's design! The framework distinguishes between float (which allows coercion from compatible types) and StrictFloat (which enforces strict type matching). By default, Pydantic prioritizes flexibility over strict validation.

The Broader Implications

This discovery highlighted a fundamental philosophical difference between languages and frameworks. Here we can see how Python, while incredibly easy to learn and perfect for throwing together a proof of concept, falls significantly behind when it comes to being a production-ready language for systems that require data integrity.

When you define an int in Go or any other more strictly-typed production-ready language, it will always be an int unless you explicitly customize it to behave otherwise. The type system enforces what you declare, preventing subtle data corruption issues that can compound over time.

Honestly not sure whether to blame the Pydantic team for these defaults or Python as a language for its overall philosophy, but the implications are clear. While we can certainly call it a beginner-friendly approach that allows users to make mistakes without immediate consequences, it's also inherently unsafe for production systems where data integrity is paramount.

The irony wasn't lost on us: our "stricter" Go implementation was actually catching data quality issues that our "flexible" Python implementation had been silently allowing through.

Choosing Strictness Over Convenience

After team discussion, we decided to maintain Go's stricter validation. While less forgiving than Python's automatic type conversion, strict typing prevents subtle data corruption issues and establishes clearer contracts with our users.

This reinforced our decision to migrate: Go's type system catches problems that Python's flexibility might hide until much later.

Current Status and Future Scaling

Today, we're running both API versions while encouraging new customers to use our Go-based v2 endpoints. The Python v1 API remains available for existing customers, but we expect to phase it out as usage migrates.

The next engineering challenge is database sharding to support even higher throughput, but we're confident in our application layer's ability to handle whatever our customers throw at it.

Lessons Learned: When to Rewrite

Our Python-to-Go migration taught us several key lessons about technology choices in startup environments:

1. Prototype Fast, But Know When to Rewrite

Python FastAPI was perfect for validating our market and gathering early customer feedback. But when performance became a blocking issue for customer satisfaction, we didn't hesitate to rewrite in a more suitable technology.

2. Performance Isn't Just About Speed

Go didn't just make our system faster it made it predictable. The elimination of CPU spikes and crashes was as important as the raw performance improvements.

3. Type Safety Prevents Future Problems

Go's stricter type system catches data integrity issues that Python's flexibility might allow through, preventing subtle bugs that could affect customer data quality.

4. Sometimes Less User-Friendly Is Better

Rejecting malformed data at the API level, rather than silently converting it, results in a more reliable platform that customers can trust with their critical telemetry data.

Conclusion

The decision to rewrite our ingest pipeline in Go wasn't taken lightly, but the results vindicated our approach. We achieved a 10x improvement in efficiency, eliminated the crashes and unpredictable behavior that plagued our Python implementation, and built a foundation that can scale with our customers' growing data needs.

Most importantly, we learned that choosing the right tool for the job sometimes means choosing different tools for different phases of your product's lifecycle. Python got us to market quickly and helped us understand our customers' needs. Go gives us the performance and reliability to serve those needs at scale.

For other teams facing similar decisions: don't be afraid to rewrite when the technology that got you started can't get you where you need to go. Sometimes the best path forward is a completely new foundation.