Research Q&A · 3 k analysts · Hong Kong

Best tool for every query.
Seven models. One stack.

Route each question to the model that wins on quality-per-dollar (Bedrock SG for Western frontier, Alibaba HK for Chinese models), cache four layers deep, and skip the H20 cluster unless Compliance forces our hand.

All-in monthly bill

$14,290

at 3 k MAU · 200 msg/analyst · ~$4.76 / user-mo · Bedrock + Alibaba + 4-layer cache

Target users

3 k

monthly active

Token volume

6.90 B

tokens / month · 70 % cacheable

Models in rotation

routed by query class

Vs 6× H20 quote

~4×

cheaper · 5-yr

Three decisions, in order

Decision 1 · Routing

Best tool for the job

Sonnet 4.6 for synthesis, Opus 4.7 for premium escalation, Haiku 4.5 + Nova Pro for cheap lookups, Llama 4 Scout for long context, Qwen3 Max + DeepSeek V4 for Chinese work. Same stack, seven models, one router.

Decision 2 · Sizing

If forced on-prem: 2 H20s, not 6

3 k MAU at 200 msg/analyst peaks at ~3 500 tok/s output. One H20 server (8 GPU) sustains ~2 000 tok/s on DeepSeek V4 FP8. Two machines is HA + 20 % headroom — Laurent's 6× quote was sized for training + multi-model, not pure inference.

Decision 3 · Caching

Cache four layers

Embedding · retrieval · prompt · semantic-answer. Bedrock prompt cache alone is the biggest single saver — ~90 % off input cost on warm sessions, no quality loss.

The router

Multi-model routing · best tool for every query class

A small classifier (regex + Gemini 3.1 Flash-Lite zero-shot) tags each query into one of seven buckets and dispatches in < 50 ms. Western frontier on Bedrock SG; Chinese models on Alibaba HK. Sonnet 4.6 takes the heavy synthesis half; the other six lanes pick up the niches.

Query router

Classifies on language, complexity, tool-need, and cost-tier in < 50 ms. Falls back to Sonnet 4.6 on ambiguity.

600 k messages / month

Avg query: 10.5 k input tokens (system + tools + 8 retrieved chunks + history) · 1 k output tokens. ~70 % cache-eligible.

52%

Claude Sonnet 4.6 Bedrock · default · synthesis

Long-context summaries, multi-note comparisons, English research synthesis. Bedrock prompt cache cuts input ~90 % across a session.

$8,330

/ mo · cached

Claude Opus 4.7 Bedrock · premium escalation

DCF reviews, target-price challenges, audit-grade reasoning. Reserved for the queries where being wrong is a regulatory event.

$1,596

/ mo

15%

Claude Haiku 4.5 Bedrock · cheap path · lookups

"What's the latest target on TCEHY?" · "Show me CLSA's last note on China property" · fast factual lookups, summarization, classification.

$800

/ mo

10%

Qwen3 Max Alibaba · Mainland CN research

Native Chinese fluency, strong on A-share / CSRC regulator filings. The pick when the source text is Mandarin and quality matters.

$864

/ mo

Amazon Nova Pro Bedrock · structured outputs

JSON-schema outputs, MCP tool chains (Bloomberg, CapIQ, internal models). Cheap structured calls where the model just has to fill in fields correctly.

$346

/ mo

Llama 4 Scout Bedrock · long context · 10 M

10 M-token context window. Whole-report scans, "where did we mention X across the archive?" — cheapest input tokens in the rotation.

$86

/ mo

DeepSeek V4 Alibaba · CN ↔ EN cheap heavy

EN ↔ ZH translation of mainland filings · cheap heavy reasoning · open-weights fallback if compliance ever forces self-host.

$74

/ mo

Sonnet handles half the volume and absorbs ~69 % of LLM cost — which is exactly why prompt caching is the highest-leverage thing we ship. The cheap-path lanes (Haiku + Nova + Llama 4 + DeepSeek) pick up 36 % of queries at ~11 % of generation cost.

Generation total $12,096 / mo

The fall-back plan

How many H20 machines do we actually need?

Only relevant if Compliance ever blocks the cloud path. For pure inference at 3 k analysts at 200 msg/analyst on DeepSeek V4 FP8: one machine just barely covers peak; two gives comfortable headroom + HA. Six is over-built by 3×.

machine

8 GPU · 768 GB VRAM

tight

peak utilization ~100 %

Sustained throughput ~2 000 tok/s

Annual all-in cost ~$285 k / yr

5-yr TCO ~$1.43 M

No HA. One reboot or hardware failure = service down. Acceptable for an internal pilot, not for a production analyst tool.

machines

16 GPU · 1.5 TB VRAM

right size

peak utilization ~80 %

Sustained throughput ~4 000 tok/s

Annual all-in cost ~$213 k / yr

5-yr TCO ~$1.07 M

Active-active HA. ~20 % headroom for traffic growth or model swaps. If on-prem becomes mandatory, this is the answer.

machines

48 GPU · 4.5 TB VRAM

over-built

peak utilization ~30 %

Sustained throughput ~12 000 tok/s

Annual all-in cost ~$640 k / yr

5-yr TCO ~$3.20 M

Laurent's quoted scale. Makes sense if we're also doing fine-tunes, embedding training, or hosting 3+ models simultaneously. For pure inference at our volume, no.

Cloud (recommended)

$171 k / yr

$14.3 k / mo

1× H20 self-host

$285 k / yr

1.7× more

2× H20 self-host

$213 k / yr

1.9× more

6× H20 (Laurent)

$640 k / yr

4× more

How a question becomes an answer

The RAG pipeline · 6 stages, 1 reranker, router-selected generator

end-to-end · ~3.5 s p50 · ~6 s p95

1 · Question

User query

+ last 6 turns

2 · Embed

BGE-M3 · self-host

1024-dim · ~30 ms · L40S in HK

3 · Retrieve

OpenSearch · ap-southeast-1

hybrid: dense + BM25 · top-50

4 · Rerank

Cohere Rerank 3.5

top-50 → top-8 · ~120 ms

5 · Route

Pick model

language · complexity · tools

6 · Generate + tools

Router picks the model

prompt-cached · MCP tools · streamed

The cost lever

Caching · 4 layers, in order of impact

Hit-rates conservative after 30 days warm-up. Real numbers tend higher because analysts ask variants of the same question all week.

Embedding cache · Redis

hash(query) → vector. 7-day TTL. Catches duplicate phrasing across analysts.

35%

Retrieval cache · Redis

hash(embedding + filters) → top-K doc IDs. 10-min TTL — fresh notes still surface.

25%

Bedrock prompt cache Biggest saver

Built into the API. ~7 k-token system + retrieved chunks stable across a session — cuts effective input cost ~90 % on warm sessions.

90%

Semantic answer cache

Cosine ≥ 0.96 → cached answer. 24-hr TTL · invalidated on new ingest in the relevant sector. Only serves on a current-context hit.

15%

The bill

Where the $14,290 goes · monthly @ 3 k MAU

Monthly all-in

$14,290 / month

Sonnet 58 %

Opus

Rerank

Qwen

Haiku

Cheap

AWS infra

Sonnet 4.6 · Bedrock $8,330

Opus 4.7 · Bedrock $1,596

Cohere Rerank 3 · Bedrock $1,200

Qwen3 Max · Alibaba $864

Haiku 4.5 · Bedrock $800

Nova Pro · Bedrock $346

Llama 4 Scout · Bedrock $86

DeepSeek V4 · Alibaba $74

ECS Fargate + ALB $300

Aurora + ElastiCache $200

OpenSearch · S3 · embed $240

CloudWatch + X-Ray $250

Per analyst / mo

$4.76

Per query (avg)

$0.024

Per analyst / yr

$57.15

Before we ship

Four questions for the steering meeting

Compliance

Does Compliance accept research-note bodies traversing AWS Bedrock SG and Alibaba Cloud HK with no-retention contracts? If not, we flip to 2× H20 self-host on DeepSeek V4 — same RAG plumbing, $213 k/yr instead of $171 k/yr.

Eval rubric

200-question golden set rated by senior analysts before launch. The router relies on offline benchmarks of which model wins which class — without that, "vibes" wins arguments and the cost story falls apart.

Note freshness

How fresh do retrieved notes need to be? "Within an hour" is trivial; sub-minute is hard. The answer shapes our cache invalidation and retrieval-cache TTL.

Citations

Citation-back to original notes is non-negotiable for sell-side Q&A. Prompt + tool layer should fail loudly on uncited answers — a hallucinated price target on a CLSA letterhead is a regulatory event.

Cost figures use mid-2026 list prices from the Pricing tab. Western models priced at AWS Bedrock ap-southeast-1; Chinese models at Alibaba Cloud DashScope HK. Sonnet 4.6 number assumes 52 % traffic share, ~3-turn average sessions, Bedrock prompt-cache hit rate ~70 % on warm sessions. Self-host annualized cost prorates Laurent's 6-machine quote ($1.7 M capex / 5 yr + $296 k/yr opex) and rebuilds opex at sub-linear scaling — per-machine annual cost averages ~$130 k over a 5-year horizon as CAPEX amortizes. For the full architecture and depth, see Approach → Technical brief.

Best tool for every query. Seven models. One stack.

Best tool for the job

If forced on-prem: 2 H20s, not 6

Cache four layers

Multi-model routing · best tool for every query class

How many H20 machines do we actually need?

The RAG pipeline · 6 stages, 1 reranker, router-selected generator

Caching · 4 layers, in order of impact

Embedding cache · Redis

Retrieval cache · Redis

Bedrock prompt cache Biggest saver

Semantic answer cache

Where the $14,290 goes · monthly @ 3 k MAU

Four questions for the steering meeting

Compliance

Eval rubric

Note freshness

Citations

Best tool for every query.
Seven models. One stack.