Designing a Fault-Tolerant Multi-Dialing System for AI-Driven Voice Calls

Designing and implementing a low-latency, fault-tolerant multi-dialing system that orchestrated AI and human-assisted outbound and inbound calls at scale.

RoleFounding Engineer
StackNode.js, Redis, PostgreSQL, Vonage, Deepgram, GPT, AssemblyAI, Docker, EC2
TimelineFounding Phase
CompanyAI Calls

Context & Problem

AI Calls was built to automate outbound cold calls and handle inbound customer calls using AI. Users could upload leads via Excel or connect their CRM, configure AI agents, and launch calling campaigns where multiple phone numbers dialed leads in parallel. While calling a single number was straightforward, coordinating multiple numbers, campaigns, and AI conversations in real time introduced significant system design challenges.

The core challenge was multi-dialing orchestration under real-world constraints. A single user could assign up to four phone numbers to a campaign, and each lead could have multiple phone numbers. The system had to ensure that:

  • A lead is never called more than once at the same time
  • A phone number never dials a lead already being contacted by another number
  • When one call connects, all parallel calls are immediately cancelled
  • Campaigns can be paused and resumed without losing progress
  • Server crashes or restarts do not cause duplicate calls or infinite loops

At peak usage, campaigns ran in parallel across accounts, handling thousands of calls per day. Any inconsistency in state management could result in duplicate calls, dropped leads, or stalled campaigns, making correctness and fault tolerance more important than raw throughput.

Key Engineering Decisions

Worker–Task Based Dialing Algorithm

Algorithm DesignConcurrency

I modeled each phone number as a worker and each call attempt as a task. When a campaign started, workers were created for each number and tasks were generated for each lead–number combination. Tasks were assigned in groups so that all parallel calls for a single lead were coordinated, while already-assigned tasks were skipped during scheduling. This ensured no lead or number was ever double-booked.

State-Driven Call Outcomes and Retries

State ManagementReliability

Every call attempt stored its outcome (human answered, machine detected, unreachable, cancelled, etc.). This enabled flexible retry strategies, accurate campaign progression, and future workflows such as flagging problematic numbers. Once all tasks for a lead completed, workers automatically advanced to the next available lead.

Fault Tolerance via Persistent State and Recovery

Fault ToleranceResilience

Campaign state was persisted using a combination of database storage and Redis. In case of server crashes or restarts, cron-based recovery jobs resumed campaigns from the last known safe state, preventing infinite loops or duplicate calls.

Real-Time AI Call Pipeline

Real-Time SystemsStreaming

Because no end-to-end AI voice platforms were available at the time, I designed a custom streaming pipeline: Vonage streamed audio (640-byte buffers every 25ms) to Deepgram for speech-to-text, GPT generated responses, AssemblyAI handled text-to-speech, and audio was streamed back to Vonage in near real time.

Low-Latency Optimization via Dedicated AI Service

PerformanceLatency

To reduce end-to-end latency to ~2 seconds, I introduced a dedicated microservice (‘brain’) responsible solely for AI audio streams. This service was resource-prioritized, used optimized prompts and models, and carefully managed audio buffers to avoid bottlenecks.

balance

Trade-offs & Constraints

The system introduced significant complexity: coordinating workers and tasks, maintaining consistent state across failures, and managing a real-time AI pipeline. There was also operational overhead in running multiple services and tuning AI latency. These trade-offs were accepted to guarantee correctness, avoid duplicate calls, and deliver a reliable experience under concurrent campaign execution.

Business Impact

2k–3k calls/dayCall Volume

Handled concurrent outbound campaigns with up to four parallel calls per account

~2 secondsLatency

End-to-end AI response latency achieved through streaming and service isolation

€1,000 MRRRevenue

Reached production usage with multiple contracts before acquisition

What I'd improve at 10× scale

At higher scale, I would introduce distributed worker coordination using a queue-based scheduler, stronger idempotency guarantees around call initiation, and regional deployment of the AI streaming service to further reduce latency. I would also explore newer end-to-end AI voice platforms to simplify parts of the pipeline while preserving orchestration control.