Working draft — Sancto AI is expanding this with anonymized data from our last three voice deployments.

The three viable paths

  1. Full-stack vendor (Retell, Vapi, Bland, Synthflow). They give you a phone number, a builder, and an LLM behind it. Live in days.
  2. Component vendors (Twilio + Deepgram + OpenAI Realtime). You orchestrate. More control. More code.
  3. Hybrid. Vendor for telephony + STT, custom for LLM logic + tool calls. Our default.

Cost curves (per minute, talk time)

PathCost / minSetup time
Retell / Vapi / Bland$0.18–$0.321–5 days
Twilio + Deepgram + OpenAI Realtime (DIY)$0.10–$0.183–6 weeks
Hybrid (Twilio + your LLM)$0.12–$0.222–4 weeks

Crossover point: roughly 10,000+ minutes/month. Below that, vendors win on TCO. Above that, building wins — sometimes dramatically (5,000 minutes/day ≈ $4k–$8k/mo on vendor vs $2k–$3k DIY).

Where vendors win

  • Speed to first customer call
  • Out-of-box: barge-in, interruption handling, voice variety
  • No telephony expertise required
  • SIP, transfers, IVR fallback — all handled

Where building wins

  • Per-minute cost at volume
  • Custom tool calls (CRM lookup mid-call, calendar booking with custom rules)
  • Data residency (vendor sends audio to their cloud — you may not be able to)
  • Multi-language with consistent quality across all

What kills voice projects before launch

  1. Latency. Anything over 800ms response feels broken. Test in production-like conditions, not localhost.
  2. Interruption handling. Humans interrupt. Your agent has to stop talking immediately and resume sensibly.
  3. Hallucinated bookings. The model confidently writes "Tuesday at 3pm" when the calendar shows 4pm. Always confirm tool outputs back to the caller.
  4. The 5% accent failure. 95% accuracy on accents sounds great until you remember 5% of your customers can't use the product.

Our recommendation

Under 5k minutes/month, single language, simple flow: Retell or Vapi. Done in a week, move on.

5k–30k minutes/month, custom integrations needed: Hybrid. Telephony from vendor, brain from you.

30k+ minutes/month or strict data residency: Full DIY. It's a project, not a config — but the unit economics demand it.

Voice AI is the rare AI product where the LLM is the easy part. The other 80% — telephony, latency, interruptions, tool calling — is what eats your timeline.