← Pricing / Strategy
Research Q&A · 3 k analysts · Hong Kong

Best tool for every query.
Seven models. One stack.

Route each question to the model that wins on quality-per-dollar (Bedrock SG for Western frontier, Alibaba HK for Chinese models), cache four layers deep, and skip the H20 cluster unless Compliance forces our hand.

All-in monthly bill
$14,290
at 3 k MAU · 200 msg/analyst · ~$4.76 / user-mo · Bedrock + Alibaba + 4-layer cache
Target users
3 k
monthly active
Token volume
6.90 B
tokens / month · 70 % cacheable
Models in rotation
7
routed by query class
Vs 6× H20 quote
~4×
cheaper · 5-yr
Three decisions, in order
Decision 1 · Routing

Best tool for the job

Sonnet 4.6 for synthesis, Opus 4.7 for premium escalation, Haiku 4.5 + Nova Pro for cheap lookups, Llama 4 Scout for long context, Qwen3 Max + DeepSeek V4 for Chinese work. Same stack, seven models, one router.

Decision 2 · Sizing

If forced on-prem: 2 H20s, not 6

3 k MAU at 200 msg/analyst peaks at ~3 500 tok/s output. One H20 server (8 GPU) sustains ~2 000 tok/s on DeepSeek V4 FP8. Two machines is HA + 20 % headroom — Laurent's 6× quote was sized for training + multi-model, not pure inference.

Decision 3 · Caching

Cache four layers

Embedding · retrieval · prompt · semantic-answer. Bedrock prompt cache alone is the biggest single saver — ~90 % off input cost on warm sessions, no quality loss.

The router

Multi-model routing · best tool for every query class

A small classifier (regex + Gemini 3.1 Flash-Lite zero-shot) tags each query into one of seven buckets and dispatches in < 50 ms. Western frontier on Bedrock SG; Chinese models on Alibaba HK. Sonnet 4.6 takes the heavy synthesis half; the other six lanes pick up the niches.

Query router
Classifies on language, complexity, tool-need, and cost-tier in < 50 ms. Falls back to Sonnet 4.6 on ambiguity.
600 k messages / month
Avg query: 10.5 k input tokens (system + tools + 8 retrieved chunks + history) · 1 k output tokens. ~70 % cache-eligible.
52%
Claude Sonnet 4.6 Bedrock · default · synthesis
Long-context summaries, multi-note comparisons, English research synthesis. Bedrock prompt cache cuts input ~90 % across a session.
$8,330
/ mo · cached
2%
Claude Opus 4.7 Bedrock · premium escalation
DCF reviews, target-price challenges, audit-grade reasoning. Reserved for the queries where being wrong is a regulatory event.
$1,596
/ mo
15%
Claude Haiku 4.5 Bedrock · cheap path · lookups
"What's the latest target on TCEHY?" · "Show me CLSA's last note on China property" · fast factual lookups, summarization, classification.
$800
/ mo
10%
Qwen3 Max Alibaba · Mainland CN research
Native Chinese fluency, strong on A-share / CSRC regulator filings. The pick when the source text is Mandarin and quality matters.
$864
/ mo
8%
a
Amazon Nova Pro Bedrock · structured outputs
JSON-schema outputs, MCP tool chains (Bloomberg, CapIQ, internal models). Cheap structured calls where the model just has to fill in fields correctly.
$346
/ mo
8%
L
Llama 4 Scout Bedrock · long context · 10 M
10 M-token context window. Whole-report scans, "where did we mention X across the archive?" — cheapest input tokens in the rotation.
$86
/ mo
5%
DeepSeek V4 Alibaba · CN ↔ EN cheap heavy
EN ↔ ZH translation of mainland filings · cheap heavy reasoning · open-weights fallback if compliance ever forces self-host.
$74
/ mo
Sonnet handles half the volume and absorbs ~69 % of LLM cost — which is exactly why prompt caching is the highest-leverage thing we ship. The cheap-path lanes (Haiku + Nova + Llama 4 + DeepSeek) pick up 36 % of queries at ~11 % of generation cost.
Generation total $12,096 / mo
The fall-back plan

How many H20 machines do we actually need?

Only relevant if Compliance ever blocks the cloud path. For pure inference at 3 k analysts at 200 msg/analyst on DeepSeek V4 FP8: one machine just barely covers peak; two gives comfortable headroom + HA. Six is over-built by 3×.

1
machine
8 GPU · 768 GB VRAM
tight
peak utilization ~100 %
Sustained throughput ~2 000 tok/s
Annual all-in cost ~$285 k / yr
5-yr TCO ~$1.43 M
No HA. One reboot or hardware failure = service down. Acceptable for an internal pilot, not for a production analyst tool.
2
machines
16 GPU · 1.5 TB VRAM
right size
peak utilization ~80 %
Sustained throughput ~4 000 tok/s
Annual all-in cost ~$213 k / yr
5-yr TCO ~$1.07 M
Active-active HA. ~20 % headroom for traffic growth or model swaps. If on-prem becomes mandatory, this is the answer.
6
machines
48 GPU · 4.5 TB VRAM
over-built
peak utilization ~30 %
Sustained throughput ~12 000 tok/s
Annual all-in cost ~$640 k / yr
5-yr TCO ~$3.20 M
Laurent's quoted scale. Makes sense if we're also doing fine-tunes, embedding training, or hosting 3+ models simultaneously. For pure inference at our volume, no.
Cloud (recommended)
$171 k / yr
$14.3 k / mo
1× H20 self-host
$285 k / yr
1.7× more
2× H20 self-host
$213 k / yr
1.9× more
6× H20 (Laurent)
$640 k / yr
4× more
How a question becomes an answer

The RAG pipeline · 6 stages, 1 reranker, router-selected generator

end-to-end · ~3.5 s p50 · ~6 s p95
1 · Question
User query
+ last 6 turns
2 · Embed
BGE-M3 · self-host
1024-dim · ~30 ms · L40S in HK
3 · Retrieve
OpenSearch · ap-southeast-1
hybrid: dense + BM25 · top-50
4 · Rerank
Cohere Rerank 3.5
top-50 → top-8 · ~120 ms
5 · Route
Pick model
language · complexity · tools
6 · Generate + tools
Router picks the model
prompt-cached · MCP tools · streamed
The cost lever

Caching · 4 layers, in order of impact

Hit-rates conservative after 30 days warm-up. Real numbers tend higher because analysts ask variants of the same question all week.

1

Embedding cache · Redis

hash(query) → vector. 7-day TTL. Catches duplicate phrasing across analysts.

35%
2

Retrieval cache · Redis

hash(embedding + filters) → top-K doc IDs. 10-min TTL — fresh notes still surface.

25%
3

Bedrock prompt cache Biggest saver

Built into the API. ~7 k-token system + retrieved chunks stable across a session — cuts effective input cost ~90 % on warm sessions.

90%
4

Semantic answer cache

Cosine ≥ 0.96 → cached answer. 24-hr TTL · invalidated on new ingest in the relevant sector. Only serves on a current-context hit.

15%
The bill

Where the $14,290 goes · monthly @ 3 k MAU

Monthly all-in
$14,290 / month
Sonnet 58 %
Opus
Rerank
Qwen
Haiku
Cheap
AWS infra
Sonnet 4.6 · Bedrock $8,330
Opus 4.7 · Bedrock $1,596
Cohere Rerank 3 · Bedrock $1,200
Qwen3 Max · Alibaba $864
Haiku 4.5 · Bedrock $800
Nova Pro · Bedrock $346
Llama 4 Scout · Bedrock $86
DeepSeek V4 · Alibaba $74
ECS Fargate + ALB $300
Aurora + ElastiCache $200
OpenSearch · S3 · embed $240
CloudWatch + X-Ray $250
Per analyst / mo
$4.76
Per query (avg)
$0.024
Per analyst / yr
$57.15
Before we ship

Four questions for the steering meeting

Compliance

Does Compliance accept research-note bodies traversing AWS Bedrock SG and Alibaba Cloud HK with no-retention contracts? If not, we flip to 2× H20 self-host on DeepSeek V4 — same RAG plumbing, $213 k/yr instead of $171 k/yr.

Eval rubric

200-question golden set rated by senior analysts before launch. The router relies on offline benchmarks of which model wins which class — without that, "vibes" wins arguments and the cost story falls apart.

Note freshness

How fresh do retrieved notes need to be? "Within an hour" is trivial; sub-minute is hard. The answer shapes our cache invalidation and retrieval-cache TTL.

Citations

Citation-back to original notes is non-negotiable for sell-side Q&A. Prompt + tool layer should fail loudly on uncited answers — a hallucinated price target on a CLSA letterhead is a regulatory event.

Cost figures use mid-2026 list prices from the Pricing tab. Western models priced at AWS Bedrock ap-southeast-1; Chinese models at Alibaba Cloud DashScope HK. Sonnet 4.6 number assumes 52 % traffic share, ~3-turn average sessions, Bedrock prompt-cache hit rate ~70 % on warm sessions. Self-host annualized cost prorates Laurent's 6-machine quote ($1.7 M capex / 5 yr + $296 k/yr opex) and rebuilds opex at sub-linear scaling — per-machine annual cost averages ~$130 k over a 5-year horizon as CAPEX amortizes. For the full architecture and depth, see Approach → Technical brief.