The three stacks: how voice AI is splitting into incompatible architectures

Eighteen months ago you could draw the voice AI stack as one diagram. Today it splits into at least three, and the differences matter for anyone building, buying, or investing in this space.

Vertical integration

One thesis says the only way to hit the latency and quality bar is to own the whole stack — STT, the model, TTS, and the orchestration between them. Cartesia and Deepgram are pushing in this direction from the model side. The bet is that the seams between components are where latency and errors hide.

Modular APIs

The opposite thesis says the stack should be composable, and that the winners will be the orchestration layers that let developers mix the best component at each layer. Vapi and Retell live here. The bet is that no single vendor will be best at everything, and that flexibility wins.

Carrier as platform

The third thesis is that whoever owns the phone number and the carrier relationship owns the customer. Twilio and Telnyx are extending up from the network layer into orchestration. The bet is that distribution beats architecture.

These are not just engineering choices. They imply different go-to-market motions, different margins, and different acquirers. Watch which one the enterprise buyers reward.