Case Study: Scaling Going.com's Platform to Millions of Users
How we helped Going (formerly Scott's Cheap Flights) optimize their backend to handle millions of users and scale their infrastructure.
Ivan Smirnov
Founder, Smirnov Labs
How we helped Going (formerly Scott's Cheap Flights) optimize their backend to handle millions of users and scale their infrastructure.
Ivan Smirnov
Founder, Smirnov Labs
When Going.com (formerly Scott’s Cheap Flights) approached us in early 2024, they had a good problem: explosive growth. Their flight deal notification service had grown from a small newsletter to a platform serving millions of travelers, but their infrastructure was struggling to keep up.
This case study walks through how we helped Going scale their platform, reduce costs, and improve reliability—all while maintaining their signature lightning-fast deal alerts.
Going.com provides personalized flight deal alerts. Users set their home airports and preferences, and Going’s algorithms scan millions of flights daily to find unusually cheap deals—often saving travelers $500+ per booking.
The problem: As they scaled from 500K to 2M+ users, several issues emerged:
The stakes: They were onboarding major airline partnerships and preparing for a Series B fundraise. The platform needed to handle 10M+ users within 12 months.
Existing architecture:
Pain points:
Before proposing major architecture changes, we needed to understand the full system and deliver immediate value.
Deep Dive Activities:
Quick Wins Delivered:
Optimized Database Queries
users.last_notified_atTuned Redis Configuration
Right-Sized EC2 Instances
Impact after 3 weeks: Alert delays dropped from 30 minutes to 5 minutes during peak hours, and the team had breathing room to plan the larger refactor.
With immediate fires out, we designed a long-term architecture to support 10M+ users:
High-Level Design:
┌──────────────────────────────────────────────┐
│ Load Balancer (ALB + CloudFront CDN) │
└────────────────┬─────────────────────────────┘
│
┬────────────┴────────────┬
▼ ▼
┌────────────┐ ┌────────────┐
│ Rails │ │ Rails │
│ API Tier │ ◄──────► │ API Tier │
│ (Stateless)│ │ (Stateless)│
└─────┬──────┘ └─────┬──────┘
│ │
└───────────┬───────────┘
▼
┌──────────────────┐
│ Message Queue │
│ (AWS SQS/SNS) │
└────────┬─────────┘
│
┬────────────┴────────────┬
▼ ▼
┌─────────┐ ┌─────────┐
│ Flight │ │ Alert │
│ Scanner │ │ Sender │
│ Workers │ │ Workers │
└────┬────┘ └────┬────┘
│ │
└────────────┬───────────┘
▼
┌──────────────────┐
│ Aurora Postgres │
│ (Read Replicas) │
└──────────────────┘
Key Architectural Decisions:
1. Decouple Flight Scanning from Alert Sending
Previously, one background job did everything: check flights, compare prices, send alerts. This created thundering herds and made scaling impossible.
New approach:
Impact: Each system can scale independently. Scanner workers scale with flight volume, alert workers scale with user count.
2. Implement Read Replicas for Query Distribution
The primary database was handling reads and writes, causing contention.
Solution:
Impact: Database CPU dropped from 85% → 40%. Read query latency improved 3x.
3. Build a Caching Layer for User Preferences
User preferences (home airport, deal thresholds) were queried on every price check.
Solution:
Impact: 95% of preference lookups served from cache. Reduced database reads by 70%.
4. Adopt Event-Driven Architecture
Instead of polling for state changes, emit events:
FlightPriceChanged event triggers alert workflowUserSubscribed event warms cacheDealExpired event cleans up old notificationsImplementation: AWS SNS + SQS with topic-based routing.
We couldn’t rewrite everything at once. Used the strangler fig pattern:
Sprint 1-2: Flight Scanner Extraction
Sprint 3-4: Alert Sender Refactor
Sprint 5-6: Database Migration to Aurora
Sprint 7-8: Observability & Hardening
Alert Delivery:
Database Performance:
System Reliability:
AWS spend reduction:
How:
Product velocity:
Fundraising success:
Customer satisfaction:
We spent the first 2 weeks just understanding the system. The N+1 query fix delivered more value than any architectural change would have initially.
Takeaway: Instrumentation and profiling before refactoring.
The monolith forced everything to scale together. Splitting flight scanning from alert sending let each scale based on its actual load.
Takeaway: Identify bounded contexts and separate them.
We could have rebuilt the entire platform in a new language. Instead, we extracted pieces incrementally, reducing risk.
Takeaway: Migrate in small, reversible steps with feature flags.
Going’s workload is read-heavy (millions of price checks, fewer writes). Caching user preferences and flight data reduced database load 70%.
Takeaway: Identify high-read, low-write data and cache it.
You can’t debug what you can’t see. OpenTelemetry + Grafana gave the team visibility they never had before.
Takeaway: Build observability in from day one of any migration.
As the platform stabilized and the team grew (8 → 18 engineers), my role shifted from hands-on implementation to strategic oversight:
Months 1-4: Heavy hands-on—architecture design, code reviews, migrations Months 5-8: Engineering mentorship, incident response, optimization Months 9-12: Quarterly check-ins, strategic technical advice
Today, Going.com has a strong engineering team and a scalable platform. They’re ready for the next phase of growth.
Going.com’s transformation wasn’t just about technology—it was about building a foundation for the business. The architectural improvements enabled them to:
If you’re facing similar scaling challenges—whether it’s performance, cost, reliability, or team velocity—you don’t have to figure it out alone.
Let’s talk about your specific situation and how fractional CTO services can help you scale successfully.
Note: All metrics shared with Going.com’s permission. Some technical specifics omitted for confidentiality.
If the challenges discussed in this article resonate with you, let's talk. I help startups navigate complex technology decisions, scale their teams, and build products that last.