Building the Inference Engine

Behind the scenes of how we architected the low-latency backend for competitive cognitive gamification.

Most game backends are built for throughput. Ours had to be built for time. When your product's core promise is sub-second cognitive challenge delivery, every architectural decision — from database query structure to network routing — becomes a performance question. This is the honest account of how we built it, what broke, and what we learned.

The Problem No Framework Tells You About

When we started building Choreos Labs, the technical brief sounded straightforward: a mobile app with a backend API. Users play cognitive games, scores are recorded, levels adapt, leaderboards update. Standard CRUD. Pick a framework, pick a database, ship it.

Except the problem was not storing data. The problem was time.

A cognitive gamification platform is not like a social feed or a shopping cart. The user is actively competing — against a clock, against their own previous performance, against other players in real time. Every interaction carries a timestamp that is scientifically meaningful. A 40ms delay in delivering the next stimulus is not a UX inconvenience — it is a measurement error that corrupts the cognitive assessment the game is built on.

The backend had to do something most backends never need to do: respect milliseconds as a unit of correctness, not just performance.

That constraint changed everything.

Choosing FastAPI: The Decision That Held Up

The first real decision was the framework. We evaluated three options seriously.

Django felt like the wrong shape — its ORM, its synchronous defaults, its opinionated structure are excellent for content-heavy applications but add abstraction layers between the code and the network that compound latency in ways that are difficult to profile and harder to remove.

Flask was lean enough but async support required external assembly. When your entire platform depends on non-blocking I/O, bolting async onto a synchronous core is technical debt you pay on every request.

FastAPI was the right answer for specific, measurable reasons. It is built on Starlette (ASGI) and Pydantic, giving us native async/await throughout the request lifecycle, automatic JSON serialization without a separate marshalling step, and OpenAPI documentation generated from the type signatures we were writing anyway. Startup time is under 500ms in our Docker container. Cold-start latency on first request after deployment — the number that matters for our Vercel-adjacent infra thinking — is negligible.

But the deeper reason was Pydantic. Our game events are strongly typed objects — a CognitiveChallengeEvent has a defined schema, a ReactionRecord has validated millisecond timestamps. FastAPI + Pydantic means the data contract between the Android client and the backend is enforced at the type system level, not discovered at runtime. Schema violations surface in development, not in production sessions.

class ReactionRecord(BaseModel):
    user_id: UUID
    game_id: str
    stimulus_presented_at: int       # Unix ms
    response_recorded_at: int        # Unix ms
    reaction_time_ms: int            # derived, validated
    stimulus_type: StimulusType
    response_correct: bool
    session_id: UUID

    @validator("reaction_time_ms")
    def validate_reaction_time(cls, v, values):
        computed = values["response_recorded_at"] - values["stimulus_presented_at"]
        if abs(v - computed) > 5:    # 5ms tolerance for float precision
            raise ValueError("Reaction time inconsistent with timestamps")
        return v

This validator alone caught three separate client-side timing bugs in development. The backend refused to accept logically inconsistent data before it could corrupt the leaderboard.

The Data Architecture: Two Databases, One Reason

Early in design we made a decision that looked over-engineered in week one and looked prescient in month two: separate hot and cold data into separate databases.

PostgreSQL handles everything durable: user profiles, session records, historical performance data, leaderboard snapshots, subscription state. It is the source of truth. We use it for anything that needs ACID guarantees — anything where losing a row is a product failure, not just a latency spike.

Redis handles everything live: active session state, real-time leaderboard rankings, stimulus queue state, rate-limiting counters, pub/sub for multiplayer event broadcasting. Redis Sorted Sets are natively designed for leaderboards — the ZADD, ZRANK, and ZRANGE operations run in O(log N) time and return results in under 1ms at our current user volumes.

The specific decision that mattered most was never asking PostgreSQL for a leaderboard during an active game session. A SQL query with JOINs across a sessions table and a users table with ORDER BY and LIMIT, under concurrent write load, can spike to 80–120ms on a cold cache. In a cognitive game where the stimulus loop runs on a 2000ms cycle, that is 5–6% of the entire session window consumed by one database read.

Redis returns the same data in under 1ms. We write to Redis immediately on every score event via a background task, and batch-sync to PostgreSQL every 30 seconds. The leaderboard the user sees is real-time. The data that persists is eventually consistent by design, with the reconciliation happening outside the hot path.

async def record_game_event(event: ReactionRecord, db: AsyncSession, redis: Redis):
    # Critical path: Redis write only (~0.8ms)
    await redis.zadd(
        f"leaderboard:{event.game_id}:session:{event.session_id}",
        {str(event.user_id): compute_score(event)}
    )

    # Background task: PostgreSQL persistence (non-blocking to client)
    background_tasks.add_task(persist_event_to_postgres, event, db)

    return {"status": "recorded", "latency_ms": event.reaction_time_ms}

The client gets its acknowledgement in under 3ms. PostgreSQL writes happen asynchronously. The user never waits for disk.

The Stimulus Delivery Problem

The hardest engineering problem we faced was not storage or computation. It was stimulus delivery latency variance — the fact that on any given request, the 95th percentile response time was acceptable but the 99th percentile was creating measurable outliers in user reaction time data.

A user with a genuine 180ms reaction time should not record 220ms because the server took an extra 40ms on that specific request. If our cognitive measurement platform cannot distinguish between biological latency and infrastructure latency, the entire scientific premise of the product collapses.

The investigation led us through three layers of variance:

Layer 1 — Middleware overhead. Our initial FastAPI setup included logging middleware, authentication middleware, and CORS middleware running on every request. Even lightweight middleware adds 2–5ms per layer. We restructured: authentication is cached in Redis (JWT validation against a stored token fingerprint), CORS is handled at the reverse proxy level, and logging is async fire-and-forget. Stimulus delivery endpoints run with minimal middleware stacks.

Layer 2 — Database connection pool contention. Under concurrent game sessions, connection pool exhaustion was causing requests to queue while waiting for an available database connection. We switched from SQLAlchemy's default synchronous pooling to asyncpg with a dedicated async pool sized to our expected concurrency ceiling, with a fast-fail timeout rather than silent queuing.

# asyncpg pool config for stimulus endpoints
DATABASE_POOL = await asyncpg.create_pool(
    DATABASE_URL,
    min_size=5,
    max_size=20,
    max_inactive_connection_lifetime=300,
    command_timeout=8.0        # fail fast, never let a slow query block the game loop
)

Layer 3 — JSON serialization on large response objects. Our early stimulus response objects included full user profile data on every game event response — defensive over-fetching from mobile development habits. Trimming the stimulus delivery response to the exact fields the client needed (next stimulus, timestamp, score delta) reduced serialization time from ~4ms to under 0.5ms per response.

After these three fixes, p99 stimulus delivery latency dropped from 67ms to under 12ms in load testing.

The Adaptive Difficulty Engine

The product's core intellectual value is not the games themselves — it is the adaptive difficulty system that makes each session feel calibrated to the individual user's current cognitive state. A game that is too easy produces boredom. Too hard produces disengagement. The 1–2% difficulty window above current skill — what Csikszentmihalyi called the "flow channel" — is what the engine is always trying to find.

The difficulty engine runs as a separate service, deliberately decoupled from the game session loop. It consumes a stream of ReactionRecord events via a Redis pub/sub channel and maintains a per-user cognitive state model — a rolling window of performance across five parameters: reaction time, accuracy rate, streak length, session fatigue curve, and performance variance.

class CognitiveStateModel:
    user_id: UUID
    reaction_time_ema: float       # Exponential moving average, α=0.3
    accuracy_rate: float           # Last 20 stimuli
    streak: int
    fatigue_index: float           # Computed from session length + variance trend
    next_difficulty: DifficultyLevel

    def update(self, event: ReactionRecord) -> DifficultyLevel:
        self.reaction_time_ema = (
            0.3 * event.reaction_time_ms + 0.7 * self.reaction_time_ema
        )
        # Difficulty moves up when accuracy > 85% AND reaction time < 0.9x baseline
        # Difficulty moves down when accuracy < 65% OR fatigue_index > 0.7
        return self._compute_next_difficulty()

The EMA (exponential moving average) with α=0.3 was chosen after testing several values. Higher alpha (closer to 1.0) makes difficulty too reactive — a single bad response drops difficulty unnecessarily. Lower alpha (closer to 0.1) makes the system too slow to respond to genuine performance shifts. 0.3 tracks real cognitive state changes while smoothing over single-event noise. This is the same smoothing constant used in financial technical analysis for short-term EMA, applied here to neural performance data.

The LLM integration layer sits on top of this engine. When the cognitive state model identifies a user transitioning into a new difficulty bracket, it calls a prompt-generation service that uses the current cognitive profile to create contextually varied challenge descriptions — preventing the repetition effect that causes perceptual adaptation and reduced engagement in fixed-format cognitive training.

Docker and the Deployment Contract

The entire backend runs in Docker from day one. Not as a production convenience — as a development principle. If it runs in a container locally, it runs identically in production. No "works on my machine." No environment variable archaeology at 1am before a launch.

Our docker-compose.yml runs three services locally:

services:
  api:
    build: ./backend
    ports: ["8000:8000"]
    env_file: .env.local
    depends_on: [postgres, redis]
    command: uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload

  postgres:
    image: postgres:16-alpine
    volumes: ["pgdata:/var/lib/postgresql/data"]
    environment:
      POSTGRES_DB: choreos_dev

  redis:
    image: redis:7-alpine
    command: redis-server --maxmemory 256mb --maxmemory-policy allkeys-lru

The maxmemory-policy allkeys-lru Redis setting is critical and non-obvious. Without it, Redis under memory pressure will either return errors or refuse writes. With LRU eviction, the least-recently-used keys are silently dropped — acceptable for leaderboard cache entries and session state, not acceptable for anything that should be durable. This is why the two-database architecture matters: Redis can evict freely because PostgreSQL holds the truth.

Production on GCP uses Cloud Run for the API service — fully managed container execution with automatic scaling, pay-per-request billing at our current scale, and cold-start times under 800ms with our container image. Cloud SQL for PostgreSQL and Cloud Memorystore for Redis complete the stack. The entire infrastructure is reproducible from two Terraform files.

What We Got Wrong (The Honest Part)

No build log is honest without this section.

We underestimated WebSocket complexity. The first multiplayer mode shipped with REST polling — clients requesting leaderboard updates every 500ms. It worked. It also generated 40x the API request volume of our usage projections. We refactored to WebSocket connections for active game sessions, which required rethinking the entire session lifecycle management model. The lesson: if real-time matters to the product experience, design for WebSocket from the start. Retrofitting it is a full rewrite of the session layer.

We over-indexed on premature optimization. The stimulus delivery latency work was necessary, but we spent three weeks on it before we had confirmed that users actually experienced the variance as a problem. Some of that time would have been better spent on the adaptive difficulty model, which had more visible impact on engagement. Optimize the hot path, but only after you know which path is actually hot in production.

The LLM prompt service was the wrong layer for challenge generation. Our first implementation called the LLM synchronously on each difficulty transition — adding 600–1200ms to the session event when a level change occurred. We refactored to a prefetching model: the prompt service generates the next 5 challenge variants in the background as soon as the current difficulty level is identified, so challenges are ready before they are needed. The session never waits for generation.

The Latency Profile in Production

After all the above changes, here is the actual performance profile under normal session load:

Operation	p50	p95	p99
Stimulus delivery	4ms	9ms	14ms
Score recording	3ms	7ms	11ms
Leaderboard read (Redis)	0.8ms	2ms	4ms
Session initialization	18ms	34ms	55ms
Difficulty adaptation	async	async	async
LLM challenge prefetch	background	background	background

The numbers that matter are stimulus delivery and score recording — the two endpoints that run inside the active game loop. Everything else either runs once per session or in the background. The 14ms p99 on stimulus delivery means that in 99 out of 100 requests, the infrastructure contributes less than 14ms to what the client records as a user's reaction time.

For a platform measuring cognitive responses in the 150–350ms range, a 14ms infrastructure ceiling means we are measuring biology, not network topology. That was the goal from day one.

What Comes Next

The current architecture scales comfortably to the user volumes we are targeting in the first 12 months. The hard architectural limits we will hit beyond that are the PostgreSQL write throughput ceiling under sustained concurrent session recording (addressable with read replicas and write sharding) and the Redis memory ceiling under large-scale real-time leaderboard computation (addressable with Redis Cluster or a move to a dedicated time-series store like TimescaleDB for historical data).

The more interesting engineering frontier is not infrastructure scale — it is signal quality. As the platform collects longitudinal cognitive performance data across thousands of users, the adaptive difficulty model can move from per-user heuristics to a population-informed machine learning model that identifies cognitive patterns across demographics, training histories, and session contexts.

The backend was built to be fast. The next version will be built to be smart.

Choreos Labs backend stack: FastAPI · asyncpg · PostgreSQL 16 · Redis 7 · Docker · GCP Cloud Run · Cloud SQL · Cloud Memorystore · Kotlin (Android client)