Best tool for the job
Sonnet 4.6 for synthesis, Opus 4.7 for premium escalation, Haiku 4.5 + Nova Pro for cheap lookups, Llama 4 Scout for long context, Qwen3 Max + DeepSeek V4 for Chinese work. Same stack, seven models, one router.
Route each question to the model that wins on quality-per-dollar (Bedrock SG for Western frontier, Alibaba HK for Chinese models), cache four layers deep, and skip the H20 cluster unless Compliance forces our hand.
Sonnet 4.6 for synthesis, Opus 4.7 for premium escalation, Haiku 4.5 + Nova Pro for cheap lookups, Llama 4 Scout for long context, Qwen3 Max + DeepSeek V4 for Chinese work. Same stack, seven models, one router.
3 k MAU at 200 msg/analyst peaks at ~3 500 tok/s output. One H20 server (8 GPU) sustains ~2 000 tok/s on DeepSeek V4 FP8. Two machines is HA + 20 % headroom — Laurent's 6× quote was sized for training + multi-model, not pure inference.
Embedding · retrieval · prompt · semantic-answer. Bedrock prompt cache alone is the biggest single saver — ~90 % off input cost on warm sessions, no quality loss.
A small classifier (regex + Gemini 3.1 Flash-Lite zero-shot) tags each query into one of seven buckets and dispatches in < 50 ms. Western frontier on Bedrock SG; Chinese models on Alibaba HK. Sonnet 4.6 takes the heavy synthesis half; the other six lanes pick up the niches.
Only relevant if Compliance ever blocks the cloud path. For pure inference at 3 k analysts at 200 msg/analyst on DeepSeek V4 FP8: one machine just barely covers peak; two gives comfortable headroom + HA. Six is over-built by 3×.
Hit-rates conservative after 30 days warm-up. Real numbers tend higher because analysts ask variants of the same question all week.
hash(query) → vector. 7-day TTL. Catches duplicate phrasing across analysts.
hash(embedding + filters) → top-K doc IDs. 10-min TTL — fresh notes still surface.
Built into the API. ~7 k-token system + retrieved chunks stable across a session — cuts effective input cost ~90 % on warm sessions.
Cosine ≥ 0.96 → cached answer. 24-hr TTL · invalidated on new ingest in the relevant sector. Only serves on a current-context hit.
Does Compliance accept research-note bodies traversing AWS Bedrock SG and Alibaba Cloud HK with no-retention contracts? If not, we flip to 2× H20 self-host on DeepSeek V4 — same RAG plumbing, $213 k/yr instead of $171 k/yr.
200-question golden set rated by senior analysts before launch. The router relies on offline benchmarks of which model wins which class — without that, "vibes" wins arguments and the cost story falls apart.
How fresh do retrieved notes need to be? "Within an hour" is trivial; sub-minute is hard. The answer shapes our cache invalidation and retrieval-cache TTL.
Citation-back to original notes is non-negotiable for sell-side Q&A. Prompt + tool layer should fail loudly on uncited answers — a hallucinated price target on a CLSA letterhead is a regulatory event.
Cost figures use mid-2026 list prices from the Pricing tab. Western models priced at AWS Bedrock ap-southeast-1; Chinese models at Alibaba Cloud DashScope HK. Sonnet 4.6 number assumes 52 % traffic share, ~3-turn average sessions, Bedrock prompt-cache hit rate ~70 % on warm sessions. Self-host annualized cost prorates Laurent's 6-machine quote ($1.7 M capex / 5 yr + $296 k/yr opex) and rebuilds opex at sub-linear scaling — per-machine annual cost averages ~$130 k over a 5-year horizon as CAPEX amortizes. For the full architecture and depth, see Approach → Technical brief.
Reconnecting…
Network looks slow — attempting to reconnect
Reconnecting…
Server isn't responding — attempting to reconnect