Case Study: Scaling Going.com's Platform to Millions of Users

When Going.com (formerly Scott’s Cheap Flights) approached us in early 2024, they had a good problem: explosive growth. Their flight deal notification service had grown from a small newsletter to a platform serving millions of travelers, but their infrastructure was struggling to keep up.

This case study walks through how we helped Going scale their platform, reduce costs, and improve reliability—all while maintaining their signature lightning-fast deal alerts.

The Challenge

Business Context

Going.com provides personalized flight deal alerts. Users set their home airports and preferences, and Going’s algorithms scan millions of flights daily to find unusually cheap deals—often saving travelers $500+ per booking.

The problem: As they scaled from 500K to 2M+ users, several issues emerged:

Performance degradation during peak hours (mornings when deals were sent)
Database bottlenecks causing alert delays of 15-30 minutes
Rising infrastructure costs (spending $40K/month on AWS)
Scaling challenges with their monolithic Ruby on Rails application
Limited observability making it hard to diagnose issues

The stakes: They were onboarding major airline partnerships and preparing for a Series B fundraise. The platform needed to handle 10M+ users within 12 months.

Technical Landscape

Existing architecture:

Ruby on Rails monolith (5 years old)
PostgreSQL primary database (1TB+)
Redis for caching and background jobs
Sidekiq for async processing
Deployed on AWS EC2 (manually scaled)
Basic monitoring (New Relic, limited custom metrics)

Pain points:

Flight price checks took 45-90 seconds per user
Database write locks during alert sends
No horizontal scaling strategy
Manual on-call rotations to restart hung workers
Monolithic deployments risked breaking unrelated features

The Engagement

Phase 1: Assessment & Quick Wins (Weeks 1-3)

Before proposing major architecture changes, we needed to understand the full system and deliver immediate value.

Deep Dive Activities:

Profiled database queries (found N+1 queries causing 80% of load)
Analyzed Sidekiq job patterns (discovered thundering herd problem)
Reviewed AWS spend (identified wasteful over-provisioning)
Shadowed on-call engineers during an incident

Quick Wins Delivered:

Optimized Database Queries
- Added missing indexes on users.last_notified_at
- Eliminated N+1 queries in alert generation
- Result: 60% reduction in query time
Tuned Redis Configuration
- Increased memory limits and eviction policies
- Added Redis Cluster for better distribution
- Result: Cache hit rate improved from 70% → 94%
Right-Sized EC2 Instances
- Migrated to newer instance types with better price/performance
- Implemented auto-scaling policies
- Result: $8K/month savings (~20% cost reduction)

Impact after 3 weeks: Alert delays dropped from 30 minutes to 5 minutes during peak hours, and the team had breathing room to plan the larger refactor.

Phase 2: Architecture Redesign (Weeks 4-8)

With immediate fires out, we designed a long-term architecture to support 10M+ users:

High-Level Design:

┌──────────────────────────────────────────────┐
│     Load Balancer (ALB + CloudFront CDN)     │
└────────────────┬─────────────────────────────┘
                 │
    ┬────────────┴────────────┬
    ▼                         ▼
┌────────────┐          ┌────────────┐
│   Rails    │          │   Rails    │
│  API Tier  │ ◄──────► │  API Tier  │
│ (Stateless)│          │ (Stateless)│
└─────┬──────┘          └─────┬──────┘
      │                       │
      └───────────┬───────────┘
                  ▼
        ┌──────────────────┐
        │   Message Queue  │
        │  (AWS SQS/SNS)   │
        └────────┬─────────┘
                 │
    ┬────────────┴────────────┬
    ▼                         ▼
┌─────────┐              ┌─────────┐
│ Flight  │              │  Alert  │
│ Scanner │              │ Sender  │
│ Workers │              │ Workers │
└────┬────┘              └────┬────┘
     │                        │
     └────────────┬───────────┘
                  ▼
        ┌──────────────────┐
        │  Aurora Postgres │
        │  (Read Replicas) │
        └──────────────────┘

Key Architectural Decisions:

1. Decouple Flight Scanning from Alert Sending

Previously, one background job did everything: check flights, compare prices, send alerts. This created thundering herds and made scaling impossible.

New approach:

Scanner Workers: Continuously check flight prices, write to database
Alert Workers: Read price changes, generate personalized alerts, send emails
Message Queue: SQS decouples the two workflows

Impact: Each system can scale independently. Scanner workers scale with flight volume, alert workers scale with user count.

2. Implement Read Replicas for Query Distribution

The primary database was handling reads and writes, causing contention.

Solution:

Aurora PostgreSQL with 3 read replicas
Read queries route to replicas (price checks, user preferences)
Writes go to primary (alert logs, user updates)
Connection pooling with PgBouncer to manage connections

Impact: Database CPU dropped from 85% → 40%. Read query latency improved 3x.

3. Build a Caching Layer for User Preferences

User preferences (home airport, deal thresholds) were queried on every price check.

Solution:

Cache user preferences in Redis with 1-hour TTL
Invalidate cache on user updates
Batch cache warming for active users

Impact: 95% of preference lookups served from cache. Reduced database reads by 70%.

4. Adopt Event-Driven Architecture

Instead of polling for state changes, emit events:

FlightPriceChanged event triggers alert workflow
UserSubscribed event warms cache
DealExpired event cleans up old notifications

Implementation: AWS SNS + SQS with topic-based routing.

Phase 3: Migration & Rollout (Weeks 9-16)

We couldn’t rewrite everything at once. Used the strangler fig pattern:

Sprint 1-2: Flight Scanner Extraction

Built new scanner service in Go (for better concurrency)
Ran in parallel with old Ruby scanner
Compared outputs for 2 weeks, fixed discrepancies
Cut over 100% of traffic to new scanner

Sprint 3-4: Alert Sender Refactor

Extracted alert logic into separate service
Added SQS queue between scanner and sender
Enabled feature flag-based rollout (10% → 50% → 100%)

Sprint 5-6: Database Migration to Aurora

Created Aurora cluster with replication
Migrated primary database with minimal downtime (using Aurora’s migration tools)
Tested read replica failover scenarios
Monitored for performance regressions

Sprint 7-8: Observability & Hardening

Instrumented with OpenTelemetry
Built Grafana dashboards for key metrics
Set up PagerDuty alerts for SLO violations
Conducted load tests to validate 10M user capacity

The Results

Performance Improvements

Alert Delivery:

Latency: 30 minutes → 45 seconds (p95)
Throughput: 100K alerts/hour → 1M+ alerts/hour
Peak load handling: Struggled at 500K users → Comfortable at 2M+ users

Database Performance:

Query time: p95 of 800ms → 120ms
Connection pool saturation: Frequent → Zero incidents
Read replica lag: 5-10 seconds → <1 second

System Reliability:

Uptime: 99.5% → 99.95%
Mean time to recovery: 2 hours → 10 minutes
On-call incidents: 4-6 per week → <1 per week

Cost Efficiency

AWS spend reduction:

Overall infrastructure cost: $40K/month → $28K/month (30% reduction)
Cost per active user: $0.08 → $0.014 (82% reduction)
Achieved this while handling 4x more traffic

How:

Right-sized EC2 instances (using Compute Optimizer)
Moved to ARM-based Graviton instances where possible
Consolidated redundant services
Improved cache hit rates to reduce database queries

Business Impact

Product velocity:

Deployment frequency: weekly → multiple times daily
Feature development time: Freed up 40% of engineering capacity
Confidence to onboard enterprise customers

Fundraising success:

Technical due diligence passed with flying colors
Demonstrated scalability to 10M users
Closed Series B ($20M) with solid tech foundation

Customer satisfaction:

NPS increased from 65 → 78
Customer-reported issues dropped 70%
Support tickets related to “late alerts” effectively eliminated

Key Lessons

1. Measure Everything Before Optimizing

We spent the first 2 weeks just understanding the system. The N+1 query fix delivered more value than any architectural change would have initially.

Takeaway: Instrumentation and profiling before refactoring.

2. Decouple for Independent Scaling

The monolith forced everything to scale together. Splitting flight scanning from alert sending let each scale based on its actual load.

Takeaway: Identify bounded contexts and separate them.

3. Strangler Fig > Big Bang Rewrite

We could have rebuilt the entire platform in a new language. Instead, we extracted pieces incrementally, reducing risk.

Takeaway: Migrate in small, reversible steps with feature flags.

4. Cache Aggressively, Invalidate Smartly

Going’s workload is read-heavy (millions of price checks, fewer writes). Caching user preferences and flight data reduced database load 70%.

Takeaway: Identify high-read, low-write data and cache it.

5. Observability Is Non-Negotiable

You can’t debug what you can’t see. OpenTelemetry + Grafana gave the team visibility they never had before.

Takeaway: Build observability in from day one of any migration.

The Transition

As the platform stabilized and the team grew (8 → 18 engineers), my role shifted from hands-on implementation to strategic oversight:

Months 1-4: Heavy hands-on—architecture design, code reviews, migrations Months 5-8: Engineering mentorship, incident response, optimization Months 9-12: Quarterly check-ins, strategic technical advice

Today, Going.com has a strong engineering team and a scalable platform. They’re ready for the next phase of growth.

Conclusion

Going.com’s transformation wasn’t just about technology—it was about building a foundation for the business. The architectural improvements enabled them to:

Serve millions of users reliably
Reduce costs while scaling
Move fast without breaking things
Close funding with technical confidence
Build a sustainable competitive advantage

If you’re facing similar scaling challenges—whether it’s performance, cost, reliability, or team velocity—you don’t have to figure it out alone.

Let’s talk about your specific situation and how fractional CTO services can help you scale successfully.

Note: All metrics shared with Going.com’s permission. Some technical specifics omitted for confidentiality.