// engineering

USSD Session Architecture: Building for Statelessness

The SS7 network retains zero memory between screens. Every user interaction arrives at your server as a fresh HTTP POST with no context. Here's how production USSD systems manage state, handle timeouts, and guarantee transaction integrity.

Why USSD Is Stateless

The SS7 network maintains a dialogue identifier (the TCAP transaction) but stores nothing about your application state. When a user selects option "1" on a menu, your server receives a new HTTP POST with the user's cumulative input. The network doesn't know which menu the user is on, what they've selected previously, or what screen to show next.

The aggregator's webhook payload looks like this:

{
  "sessionId": "ATUid_a]b2c3d4e5f6",
  "serviceCode": "*644#",
  "phoneNumber": "+254712345678",
  "text": ""
}

On first dial, text is empty. After the user selects option 1, the next POST contains "text": "1". After they then select option 2, it becomes "text": "1*2". Each asterisk-delimited segment represents one user interaction.

Approach 1: String Concatenation

The simplest approach parses the text field to determine navigation depth:

text = ""         → Show main menu
text = "1"        → Show sub-menu for option 1
text = "1*2"      → Show sub-menu for option 1 > option 2
text = "1*2*500"  → Process: option 1, sub-option 2, amount 500

Split by asterisk. Count segments. Route accordingly. This works for simple menus but breaks on complex journeys:

  • Invalid input pollution: If the user enters "abc" on a numeric-only screen, the string becomes "1*abc". Every subsequent routing decision is based on a corrupted path
  • No backward navigation: The text string only appends. There's no way to represent "go back" without special handling
  • Branching explosion: A journey with 5 levels and 4 options per level requires matching against 1,024 possible text patterns
  • Dynamic content: When menu options come from an API (e.g., live match listings), you can't hardcode the expected text patterns at each level

Approach 2: Redis-Backed Session State

Production USSD platforms decouple state from the text string. The server maintains its own session store, using the sessionId as the cache key.

Why Redis? Relational databases are too slow. You have a 10-second budget to receive the webhook, load state, process logic, call external APIs, save state, and return the response. Redis delivers sub-millisecond reads and writes. A PostgreSQL query under load takes 5-50ms — acceptable for web apps, dangerous for USSD.

The session object:

{
  "menu_level": "match_selection",
  "temp_data": {
    "sport": "football",
    "match_id": "4521",
    "market": "1X2",
    "selection": null,
    "amount": null
  },
  "last_activity": 1713012345
}

Key design decisions:

  • TTL: Set to 180 seconds, matching the maximum USSD session duration. Redis automatically expires stale sessions. No manual cleanup required
  • Key format: ussd:session:{sessionId} for active sessions. The sessionId is unique per USSD dialogue
  • Atomic operations: Use Redis transactions (MULTI/EXEC) to prevent race conditions when updating state

Approach 3: Finite State Machine (FSM) Routing

For complex journeys, model the USSD flow as a state machine. Each screen is a node. Each user input is a transition.

  • Selection nodes: Display a menu with numbered options. User input determines the next state
  • Input nodes: Collect a variable (amount, match ID, PIN). Validate input. Store in session. Transition to next state
  • API nodes: Call an external service (odds feed, balance check). Display results. Wait for user input or terminate
  • Terminal nodes: Display a confirmation message and end the session (END response)

A resolver class traverses the state machine: load current state from Redis, determine node type, process user input, validate, transition to next state, render the screen, return CON or END.

FSM routing is what separates prototype USSD code from production USSD engines. The journey is defined as configuration (which node connects to which), not as application code (giant if/else trees). Adding a new screen means adding a node definition, not rewriting the router.

Long Codes: Skipping the Menu

Power users don't want to navigate 4 screens to send money. Long codes let them encode the entire transaction in the initial dial string:

*737*1*500*08031234567#

Parsed: shortcode *737#, transfer type 1, amount 500, destination 08031234567. The USSD engine extracts all parameters from the initial dial, skips 3 menu screens, and jumps directly to the confirmation screen.

In time-sliced markets like Uganda, long codes reduce session duration from 30+ seconds to under 10. That's the difference between 2 billing windows and 1. Half the session cost for power users.

Implementation: parse the serviceCode field with a regex or structured parser. Extract embedded parameters. Pre-populate the session state. Skip to the appropriate FSM node.

Session Resumption

USSD sessions drop. Network congestion, gateway timeouts, user accidentally pressing "End Call". The user's transaction was half-complete. They dial back in and get... the main menu. All context is lost.

The fix: decouple state from the ephemeral sessionId. Bind persistent state to the phoneNumber (MSISDN) instead.

  • On session start, check Redis for a stale incomplete session keyed by MSISDN
  • If found, display: "CON Resume your bet?\n1. Yes\n2. Start Over"
  • If the user selects "Yes", load the saved state and continue from where they dropped
  • If "Start Over", clear the stale state and show the main menu

Use a separate Redis key: ussd:resume:{phoneNumber} with a longer TTL (300-600 seconds). This survives across multiple session attempts while the active session key expires with the dialogue.

The 4-5 Click Rule

Any transaction must complete within 4-5 user interactions. This isn't a UX preference — it's a protocol constraint:

  • 180-second total session: At 10-15 seconds per screen (user reads, decides, types, submits), 5 screens = 50-75 seconds. Safe margin
  • Abandonment curve: Completion rates drop ~15% per additional screen after the 4th interaction
  • Time-sliced billing: In Uganda, every additional screen potentially triggers another 20-second billing window

A betting journey: (1) Main menu → (2) Match selection → (3) Market/odds → (4) Enter stake → (5) Confirm bet. Five clicks. Done.

Pagination

When you have more items than fit on one screen (match listings, transaction history), you need pagination. But USSD pagination is application-level, not protocol-level.

  • Maximum 5 items per screen: Each item + number + navigation = ~30 chars. 5 items + header + nav ≈ 155 chars. Tight but works
  • Navigation convention: 98. Next / 00. Back — these are industry-standard across African USSD services
  • Why not let the gateway paginate? Carrier gateways (especially Safaricom) inject their own pagination when you exceed the character limit. This overrides your navigation options and confuses users. Always paginate yourself before the gateway does

Store the current page offset in the session state. On "98", increment the offset and re-render with the next batch. On "00", decrement and re-render the previous batch.

Concurrency and Idempotency

USSD has a race condition problem. If a user taps "Confirm" and the network retries the request (gateway timeout, duplicate delivery), your server processes the same bet placement or money transfer twice.

  • Distributed Redis locking: Before processing a financial transaction, acquire a lock on ussd:lock:{phoneNumber} with a short TTL (5-10 seconds). If the lock exists, reject the duplicate request
  • Idempotency keys: Generate a unique transaction ID at the confirmation step. Store it in the session. If the same transaction ID arrives twice, return the cached result instead of re-processing
  • Atomic database operations: Use database transactions with unique constraints. A duplicate bet placement fails at the database level even if the application layer doesn't catch it

The Fire-and-Forget Pattern

Some operations take too long for the 10-second response window. Calling an external odds API, processing a payment, running a risk check. The fire-and-forget pattern handles this:

  1. Return an immediate CON response: "CON Loading odds...\n1. Continue"
  2. Spawn an async background task that calls the slow API
  3. Cache the result in Redis under a predictable key
  4. When the user presses "1", the next webhook handler reads the cached result and renders the screen

If the background task hasn't completed when the user presses "1", show a retry: "CON Still loading...\n1. Try again". This is preferable to a gateway timeout that silently kills the session.

Latency Monitoring

USSD latency monitoring requires different instrumentation than web applications. The metrics that matter:

  • P95/P99 response time: Must stay under 10 seconds. A P99 of 12 seconds means 1 in 100 requests times out at the gateway
  • Session completion rate: Percentage of sessions that reach a terminal (END) node vs abandoned/timed out. Target: >85%
  • Timeout rate by carrier: Safaricom's 30-second idle timeout catches more users than Airtel's 60-second window. Monitor per carrier
  • Redis latency: If your session store exceeds 5ms P99, you're losing budget that should go to business logic and API calls

Use geometric bucket histograms (not averages) to track response times. Averages hide the tail latency that causes session drops.

// the engineering reality

USSD looks simple from the outside: text menus on a phone. The engineering underneath is a distributed systems problem: sub-second state management, cross-carrier timeout handling, idempotent financial transactions, and graceful degradation under network congestion. Every pattern described here exists because someone's production system failed without it.

// session management, solved

The USSD Fabric handles all session state persistence, timeout recovery, FSM routing, pagination, idempotency, and carrier-specific optimisation. Your code talks to one API. We handle the distributed systems underneath.

Request Demo →