Engineering

Building a real-time commute engine: lessons from early development

CT
CommuteTimely Team
·Oct 2025·10 min

Every piece of software has a moment when reality arrives like a physics experiment gone wrong.

In theory, the system works beautifully. The architecture diagrams look elegant. The database queries return results instantly. Then real users arrive.

When we launched the private beta of CommuteTimely, reality arrived in about three days. Our database CPU hit 100 percent. API latency ballooned from 50 milliseconds to nearly four seconds. Requests began piling up like cars at a broken traffic light.

The architecture we built was fundamentally wrong for the problem we were trying to solve. So we tore it down.


The First Architecture

The earliest version of CommuteTimely looked a lot like many modern startups. The backend was written in Node.js. External traffic and transit APIs were queried through REST calls. Historical commute data lived inside PostgreSQL.

For a prototype, this setup was perfect. But prototypes live in calm waters. Real commuters create storms.

The Real Workload of Arrival Intelligence

Most web services respond to explicit requests. Arrival intelligence behaves differently. The system must continuously answer the question: “When should this person leave?”

During peak commuting hours, CommuteTimely must evaluate departure timing roughly 200,000 times per minute. The system isn’t simply fetching data. It is recomputing a prediction engine continuously as conditions change. Our first architecture tried to handle this with polling. That decision became the root of the problem.

The Problem With Polling

In the initial system, the backend periodically checked active commutes. At beta scale, it collapsed. Connection pools exhausted. Database locks multiplied. Queries began waiting on other queries. It was a classic feedback loop.

Thinking in Events Instead of Rows

Commuting is not a table. It is a constantly evolving stream of events. Instead of periodically checking whether anything had changed, the system needed to react immediately when something changed.

Kafka Becomes the Nervous System

To handle massive streams of events, we adopted Apache Kafka. Every change in the transportation environment becomes an event published into Kafka topics. Instead of asking the system repeatedly if something changed, the system now broadcasts changes instantly.

Why We Moved to Rust

Languages like JavaScript and Go periodically pause execution for garbage collection. These pauses can introduce unpredictable latency spikes. We rebuilt our core microservices in Rust.

Rust offers memory safety without garbage collection and extremely low runtime overhead. It allowed us to process millions of spatial calculations per second without unpredictable pauses.

Replacing the Database

Historical traffic models rely on terabytes of time-series data. We replaced PostgreSQL with ScyllaDB, optimized for high-performance time-series workloads. This allowed us to maintain single-digit millisecond read latency under massive write pressure.

The Push Notification Pivot

Originally, the mobile app periodically asked the server: “Should I leave yet?” Polling wastes resources. We redesigned the backend to maintain a state machine for every user and send a push notification only when the threshold is crossed.

Engineering for the Real World

The rebuild transformed the system. 99th percentile response latency dropped to 14 milliseconds globally. Compute costs decreased by nearly 60 percent.

Behind the simplicity of a countdown lies an engine processing millions of events, comparing historical patterns, and predicting the future. The goal of good engineering isn’t to expose complexity. It’s to hide it.

Share this article
Read more articles