Here's the thing about multi-agent systems in production: the demos make the LLM call look like the hard problem. It isn't.
Three months into a routing system for a bank, we hit a bug nobody had flagged. Agent A timed out at 32 seconds. Agent B had already started executing on A's partial output. The audit log now had two contradictory entries — and the customer's WhatsApp had already received a confirmation. Two records of truth. One of them lying to a customer. The fix wasn't a better prompt. It was: retry logic with idempotency keys, a reconciliation pass before any downstream action, and a hard gate so no customer-facing message ships until reconciliation confirms a single source of truth. The LLM call is the easy part. The orchestration + audit + reconciliation is the system.
— in user's voice profile