How We Achieved 116ms Response Time: AI Performance Benchmark

How We Achieved 116ms Response Time: Engineering Benchmark
Speed matters in real-time interview assistance. At 500ms, answers feel delayed. At 200ms, they feel fast. At 116ms, they feel instant — answers appear before you've even finished processing the question yourself. Here's exactly how we achieved it.
Pipeline Optimization Strategies
1. Edge-First Audio Processing
Audio processing runs locally on your machine, not in the cloud. This eliminates ~80ms of network round-trip time. We use optimized WASM-compiled VAD and STT models for near-native performance.
2. Streaming Everything
Nothing waits for full completion. Audio streams to STT, STT streams to LLM, LLM streams to overlay. Each component processes chunks as they arrive, not buffered batches.
3. Speculative Inference
As transcription streams in, we begin inference on partial input. If the question changes, we restart, but for 90% of questions, the first few words indicate the topic accurately enough to start generating.
4. Model Selection & Quantization
We maintain multiple model variants: lightweight models for quick pattern detection, quantized models for fast inference, and full-precision models for complex questions. The router selects based on question complexity.
Benchmark Comparison
| Tool | First Token Latency | Full Answer |
|---|---|---|
| AissenceAI | 116ms | 1.2s |
| Final Round AI | ~500ms | ~3-5s |
| LockedIn AI | ~300ms | ~2-4s |
| ChatGPT (manual) | ~1000ms | ~5-10s |
Why This Matters
In a live interview, you have about 3-5 seconds to start responding to a question. With 500ms+ tools, you get AI suggestions after you've already started talking (too late). At 116ms, suggestions arrive while you're still hearing the question — giving you time to plan your response. This is the core advantage of AissenceAI.