← Pricing / Approach / Technical
Executive Technical
CLSA Research Q&A · 3 k analysts · Hong Kong Recommendation

Cloud-first, multi-model, MCP-native.
Skip the GPUs.

At 3 k analysts and ~600 k messages a month (200 / analyst), the right answer is not a server room — it's a router. Send the query to the model that wins on quality-per-dollar (Bedrock SG for Western frontier, Alibaba HK for Chinese models), retrieve from a 300 k-report OpenSearch index, cache four layers deep, and let MCP servers expose CLSA Research, Bloomberg, and our own valuation models as tools.

$57.17
per analyst · per year · all-in.  ~0.2 % of one Bloomberg seat.
Monthly bill
$14,292
Annual bill
$171 k
vs Perplexity Finance
3.1× cheaper
Corpus
300 k
research reports indexed
Active users
3 k
monthly · 2× today's 1.5 k
Messages
600 k
/ month · 200 per analyst
Tokens
6.90 B
/ month · ~70 % cache-eligible
Models in rotation
7
routed by query class
Latency target
p50 3.5 s
first token < 800 ms
Assumptions analysts × msg / month 600 k msg · $14,292 / mo · $171 k / yr

Why these defaults: 50 k registered users, 1.5 k MAU today. The 3 k baseline is a 2 × target — headroom for the assistant becoming a daily habit, not a one-time sizing for today's MAU.

5-year TCO · the only chart that matters

Three paths. The cheap one is also the smartest one.

Same RAG plumbing, same MCP tools, same UX. The only thing that changes is where the model weights live and who writes the cheque. Bars are 5-year all-in: capex amortized + opex + cloud APIs that remain after self-host.

Cloud headroom vs 6× H20
+$2.34 M
5-year saving · spend it on data scientists or stronger evals
Cloud + Router
recommended
$857 k
$171 k
/ year
2× H20 hybrid
defensible if compliance forces
$1.07 M
$213 k
/ year
6× H20 on-prem(Laurent's quote)
$1.7 M capex + $296 k/yr opex
$3.20 M
$640 k
/ year

Cloud at 600 k msg / mo includes 7 Bedrock + Alibaba models, Cohere Rerank 3, Cohere embeddings, OpenSearch, Aurora, ElastiCache, CloudWatch.

2× H20 hybrid self-hosts DeepSeek V4 / Qwen3 MoE for the cheap-path 50 % of traffic; premium queries still hit Sonnet / Opus on Bedrock.

6× H20 assumes 70 % of traffic served on-prem; remaining premium queries still hit cloud. ~60 % peak utilization — still over-built for inference alone.

The three live options

Pick once. Make the second-cheapest one your plan B.

Recommended A

Cloud-first router

Sonnet 4.6 default · Opus 4.7 · Haiku 4.5 · Nova Pro · Llama 4 Scout (Bedrock) + Qwen3 Max · DeepSeek V4 (Alibaba) · 4-layer cache.

$14.3 k
/ month
$171 k / yr · $857 k / 5 yr $4.76 / user-mo
  • Best quality on every query class — frontier models win on multi-doc finance.
  • Switch model in a YAML — no hardware to depreciate.
  • Bedrock ap-southeast-1 + Alibaba HK · < 60 ms from any analyst seat.
  • No-retention enterprise contracts on both providers.
Plan B B

2× H20 hybrid

Self-host DeepSeek V4 (or Qwen3 MoE) on 16 GPU · cheap path on-prem · premium queries on cloud.

$27 k
/ month
$213 k / yr · $1.07 M / 5 yr $9.05 / user-mo
  • Active-active HA · ~2 000 tok/s sustained on DeepSeek V4 FP8.
  • Quality regresses 5 – 10 % on premium queries — keep cloud for those.
  • Real value: data sovereignty if Compliance flags cloud egress.
  • Scale path: add a 3rd machine when MAU clears 6 k.
Over-built C

6× H20 on-prem

48 GPU · 4.6 TB VRAM · sized for training + multi-model. ~2× larger than inference demand at our volume.

$57 k
/ month avg
$640 k / yr · $3.20 M / 5 yr $19.03 / user-mo
  • Peak utilization ~60 % on inference alone — half the metal still sits idle.
  • H20 is compute-cut — same memory bandwidth as H100 but 5 – 7× less FLOPS.
  • $1 M+ capex over-spend vs Option B with no measurable user-side gain.
  • Only rational if we're also doing fine-tunes / training in-house.
The full stack · top to bottom

Eleven layers. Each one is a build-or-buy decision.

Buy the layers where the market is competitive (models, embeddings, rerankers, vector DB). Build the layers that are the product (router, memory, MCP tools, eval harness). Skip the layers that don't move quality (fine-tuning, custom inference servers).

Layer 1
Frontend
Streaming chat UI · structured artifacts · client-rendered tables and charts
Tables and charts render client-side from tool-call JSON, not markdown — guaranteed correctness.
$200/mo
CloudFront + ALB · ap-southeast-1
Layer 2
Auth + sessions
SSO via existing CLSA SAML · session state in Postgres
Tier and entitlements come from upstream — no auth invented in this stack.
~$0
existing infra
Layer 3
Memory
Conversational + user-profile + episodic — three stores, one assembler
Aurora PostgreSQL for user profile + watchlist · OpenSearch Serverless for episodic memory · ElastiCache Redis for last-N turns.
$200/mo
Aurora + ElastiCache
Layer 4
Backend
Router → Memory → RAG → Generate · in-house orchestrator, no framework lock-in
Streamed token-by-token. Tool calls run in parallel. Per-session process isolation, cancellable.
build
~3 eng-weeks
Layer 5
Router
Heuristics + Gemini 3.1 Flash-Lite classifier · 7 model lanes
< 50 ms classification. Falls back to Sonnet 4.6 on ambiguity. Hot-reloadable rules.
$30/mo
Flash classifier
Layer 6
Retrieval
Hybrid: dense (Cohere embed-v3) + BM25 + metadata filters · top-50 → Cohere Rerank 3 → top-8
All via Bedrock SG. BM25 catches ticker / analyst-name lookups that pure semantics miss. Reranker is the quality lever.
$700/mo
Bedrock Cohere · OpenSearch
Layer 7
Embeddings
Cohere embed-v3 multilingual via Bedrock · 1024-dim · EN ↔ ZH solid
$750 one-time bulk for the 7.5 B-token corpus · $2 / mo incremental + queries · stays inside AWS.
$5/mo
Bedrock embed
Layer 8
Generation
7 lanes on Bedrock SG · ranked by quality-per-dollar on a CLSA golden eval
Sonnet 4.6 · Opus 4.7 · Haiku 4.5 · Nova Pro · Llama 4 Scout · DeepSeek V3.1 · Mistral Large 2.
$6.1 k/mo
Bedrock · post-cache
Layer 9
Tools (MCP)
5 MCP servers — Research, Market Data, News, Models, User
One protocol, all clients. Bloomberg / Refinitiv proxied; CLSA models native; user prefs sandboxed.
build
~6 eng-weeks
Layer 10
Storage
OpenSearch Serverless (vectors) · Aurora PostgreSQL (transactional) · S3 (reports) · ElastiCache (hot cache)
All in ap-southeast-1. Cross-region S3 replication to ap-northeast-1 off-hours for DR.
$330/mo
OpenSearch + S3
Layer 11
Observability
Per-query trace · token cost · cache hit · model picked · eval score (golden re-run nightly)
If we can't see what the router picked and why, we can't tune it. This is the discipline lever.
$250/mo
CloudWatch + X-Ray + Managed Grafana
A question becomes an answer · 8 stages

Read it left-to-right. p50 ≈ 3.5 s end-to-end.

first token < 800 ms · streaming · cancellable
1 · Question
User query
+ last 6 turns · user profile loaded
2 · Route
Classify · 1 of 7 lanes
language · complexity · tools · cost-tier
3 · Embed
Cohere embed-v3 · Bedrock
~25 ms · 35 % ElastiCache hit
4 · Retrieve
OpenSearch hybrid
dense + BM25 + filters → top-50
5 · Rerank
Cohere 3.5
top-50 → top-8 · ~120 ms
6 · Memory
Assemble context
short-term + user + episodic
7 · Generate
Routed model + MCP tools
8 · Render
Stream → artifacts
text + tables + Vega-Lite charts + citations
Embedding · the foundation

Embed once. Stay inside AWS.

The corpus is 300 k reports × ~50 pages × ~500 words ≈ 7.5 B tokens. We do this once. Cohere embed-v3 multilingual is hosted natively on Bedrock — no data leaves AWS, no extra vendor, multilingual EN ↔ ZH is solid for HK + Mainland coverage. Best-in-class for finance specifically would be Voyage-3 (top of MTEB Finance), but it's outside AWS — keep it as a future swap if eval shows the gap matters.

Embedder Where it runs Multilingual $ / 1 M tok Bulk cost Yearly Δ
Cohere embed-v3 multilingual Bedrock SG 100+ langs · solid ZH $0.10 $750 $25
Amazon Titan Text v2 Bedrock SG EN-strong · ok ZH $0.02 $150 $5
BGE-M3 (self-host on EC2 g6e) EC2 SG · custom Multilingual SOTA · open $0 marginal ~$120 GPU-hrs $300/mo HW
Qwen3-Embed (Alibaba DashScope) Alibaba HK Best ZH · solid EN $0.07 $525 $18
Voyage-3-large (finance-tuned) Voyage direct API · outside AWS EN ↔ ZH $0.18 $1,350 $45
Chunking · the un-glamorous lever

Hierarchical, parent-child: we embed at three granularities — paragraph (~500 tok), section (~2 k tok), report-summary (~250 tok). Retrieval hits at the paragraph level, but we send the parent section to the LLM so it has flow.

Citations as first-class data: each chunk carries report_id · page · para. The model is forbidden from answering without one — refusal is preferable to a hallucinated target price on a CLSA letterhead.

Quantization: scalar quant (int8) on the 1024-dim vectors → ~22 GB storage for 15 M chunks. Fits in a single OpenSearch Serverless collection — no sharding, no replica gymnastics.

Memory · short, long, episodic

Three stores. One assembler.

Frameworks like Letta and Mem0 are fine, but for finance Q&A the retrieval ranking is the product — we want explicit control over what hits the prompt. Three stores, one in-house assembler, ~600 lines of code.

Short-term · conversational
Last 6 turns, in-memory
Held in the LiveView process; rolls off after 6 turns or 30 min idle. ~3 k tokens of context, never embedded.
~3 k tok / msg 0 ms · in-process
Long-term · user profile
Postgres · per-analyst record
Watchlist (tickers), sector coverage, role (sales vs research), language preference, tier. Loaded into prompt header — query rewriting honors the profile (e.g., "show CLSA's view on Tencent" already knows this user follows TCEHY).
~500 tok / msg < 5 ms · single SQL
Long-term · episodic
Past conversations · embedded + ranked
Each completed session is summarized + embedded by the same Cohere embed-v3 model. "What did we conclude on Tencent last week?" hits a separate OpenSearch collection, scoped to user_id. Top-2 episodic hits land in the prompt only when their cosine clears 0.78.
~1 k tok when relevant ~30 ms · OpenSearch scoped
Tools · MCP-first

Five MCP servers. Same tools, every model.

MCP (Model Context Protocol) is the lingua franca for tool definitions. Anthropic, OpenAI, Google all consume it natively. We expose CLSA's data + models + news once, behind a single auth/entitlement gateway, and every router lane gets the same capabilities — no per-vendor adapters, no tool-list drift, no privilege escalation.

Servers
5
one per data domain
Tools exposed
22
across the 5 servers
Calls / day
~80 k
~4 tool calls per query avg
Latency p50
28 ms
gateway → backend round-trip
Auth model
SAML
CLSA SSO + per-tool entitlement
MCP / 1 · Research

clsa-research

~45 % calls
  • search_research_notes(query, filters) RAG entry point — hits OpenSearch hybrid index
  • get_note(note_id) Full note body + metadata + citation anchors
  • get_target_history(ticker) CLSA target-price evolution over time
  • list_recent_notes(sector, days) Fresh-content scan, sector-scoped
Connects OpenSearch S3 (note bodies) Aurora (metadata)
Owned byResearch desk · platform on AI team
MCP / 2 · Market data

clsa-marketdata

~25 % calls
  • get_quote(ticker) Live + intraday · 15-min cache for non-RT seats
  • get_chart_data(ticker, period, indicators) OHLCV for Vega-Lite spec rendering
  • get_financials(ticker, statement, period) IS / BS / CF · annual + quarterly
  • get_estimates(ticker) Consensus + CLSA forecast side-by-side
  • get_filings(ticker, type) 10-K, 10-Q, 6-K, prospectus
Connects Bloomberg B-Pipe Refinitiv SEC EDGAR
Owned byMarket Data team · entitlement-gated per ticker
MCP / 3 · News & sentiment

clsa-news

~12 % calls
  • search_news(query, filters, since) Full-text news search · sector / region / source-aware
  • get_breaking(tickers, sectors) Last 60 min stories · ranked by impact score
  • get_sentiment(ticker, period) Rolling sentiment (FinBERT) · positive / negative / neutral counts
  • summarize_news(ticker, days) Daily / weekly digest with citation anchors
Connects Bloomberg News Reuters SCMP / Caixin (HK + Mainland)
Owned byNews desk · 3-second freshness SLA
MCP / 4 · Valuation models

clsa-models

~10 % calls
  • run_dcf(ticker, assumptions) CLSA-branded DCF with override-able WACC / growth
  • peer_comparison(ticker, peers, metrics) Comp table · P/E, EV/EBITDA, P/B, growth
  • run_screener(criteria) Multi-stock filter · "show me HK tech < 15× P/E with positive cash"
  • scenario(ticker, deltas) Bear / base / bull on key drivers
Connects Quants library (Python) Aurora · cached results
Owned byEquity Research / Quants team
MCP / 5 · User context

clsa-user

~8 % calls
  • get_watchlist() Tickers the analyst follows · sector tagged
  • add_watchlist(ticker) Write — requires explicit user confirm in chat
  • set_preference(key, value) Language, default region, output verbosity
  • get_recent_history(days) Past Q&A summaries for episodic recall
Connects Aurora ElastiCache (sessions)
Owned byAI platform team · read-only by default
Why MCP
One protocol. Define the tool once; every model lane (Sonnet, Opus, Haiku, Nova Pro, Llama 4, Qwen3, DeepSeek) consumes the same spec via native MCP support — no per-vendor adapter code.
Why split into 5
Different teams own different surfaces. Research desk owns notes; Market Data owns Bloomberg proxy; News desk owns the wire feed; Quants own the valuation models. MCP draws clean ownership boundaries — and clean audit trails.
Why now
The protocol stuck. Anthropic native, OpenAI compat, Google compat as of 2026. Building on MCP means future agents — and any model we add to the router — inherit our entire toolkit free.
The router · 7 lanes

Sonnet handles half the volume. The other six pick up the niches.

Each lane has a measurable specialty on the CLSA golden eval (200 questions, senior-analyst-rated). Traffic share is set by that benchmark and re-tuned monthly. Western frontier + open-weights served via AWS Bedrock ap-southeast-1; Chinese models served via Alibaba Cloud DashScope (HK / SG). Both providers support prompt caching — the Sonnet lane is where it compounds hardest.

Lane Model Provider Share Strength $ / mo $ / query
Default · synthesis Claude Sonnet 4.6 Bedrock SG 52 % Multi-doc EN synthesis · agentic flows · prompt-cache amplifier $4,165 $0.027
Cheap path · lookups Claude Haiku 4.5 Bedrock SG 15 % Fast factual lookups · "where did we mention X?" · summarization $400 $0.009
Mainland CN research Qwen3 Max Alibaba HK 10 % Native ZH · A-share / CSRC filings · prompt cache supported $432 $0.014
Structured · numeric a Amazon Nova Pro Bedrock SG 8 % JSON-schema outputs · MCP tool chains · cheap structured calls $173 $0.007
Long-context lookups L Llama 4 Scout Bedrock SG 8 % 10 M context window · whole-report scans · cheapest input tokens $43 $0.002
CN ↔ EN · cheap heavy DeepSeek V4 Alibaba HK 5 % Translation · cheap reasoning · 90 % off-peak discount window $37 $0.002
Premium escalation Claude Opus 4.7 Bedrock SG 2 % DCF reviews · target-price challenges · audit-grade reasoning $798 $0.133
Generation total: $6,048 / mo post-cache · ≈ $0.020 / query average. Sonnet handles half the volume but absorbs ~69 % of the LLM bill — which is exactly why the prompt-cache lever (next section) is the highest-leverage thing we ship.
Three questions the CEO will ask · answer them now

Consolidate? · Gemma 4? · Caching everywhere?

The lineup we just drew is one defensible answer — these are the three pushbacks it has to survive. Numbers are for our 600 k msg/mo baseline (200 / analyst) at mid-2026 list pricing.

Q1 · Provider mix Phase 3

Consolidate to Claude + DeepSeek + OpenAI + Gemini + MiniMax?

Each one wins a niche on our golden eval. GPT-5 is best on numeric / DCF challenges. Gemini 3.1 Flash beats Llama 4 Scout on multimodal long-context (charts inside PDFs). MiniMax M2 edges Sonnet on multi-tool agentic loops. But every additional provider is a contract, ZDR addendum, billing surface, and outage path.

Lever Δ vs current
Premium-query quality + 3 – 5 %
Multimodal long-context unlocks PDFs
Vendor count 2 → 5
Egress paths to manage + 3
$ delta / yr ~ flat
Verdict: Phase 1 ships Bedrock + Alibaba (current). Trigger consolidation in Phase 3 only if the golden eval shows a >5 % quality gap — or if Compliance signs off on multimodal PDFs going to Vertex.
Q2 · "Free" weights Myth-bust

Is Gemma 4 actually free? Should we self-host it?

The weights are free under Google's Gemma license. The deployment is not. A single L40S GPU runs Gemma 4 27B at FP8 — and that L40S costs ~$1,360 / mo on EC2 before EBS, networking, vLLM ops, monitoring, or HA.

Cheap-path lane · 48 k queries / mo $ / mo vs Bedrock
Llama 4 Scout · Bedrock (current) $86 1 ×
Gemma 4 27B · g6e.4xlarge · 1 × L40S $1,500 17 ×
Gemma 4 27B · 2 × L40S (HA + headroom) $3,000 35 ×
Gemma 4 27B · on owned H20s (sunk cost) $0 * 0 ×

And the quality argument is also weak: Gemma 4 27B trails Llama 4 Scout (109 B MoE) and DeepSeek V4 on most evals — there are stronger open-weights options if we ever do self-host.

Verdict: Self-hosting open weights on rented GPUs is 17 – 35 × more expensive than calling the same class of model on Bedrock. Gemma 4 only fits when we already own the GPUs (Phase 3 H20 hybrid) — and even then Llama 4 or DeepSeek V4 are stronger picks. Tell the CEO: "free weights" is real; "free deployment" is not.
Q3 · Cache support Confirmed

Does prompt caching work everywhere we'd send traffic?

Yes — but the discount and the API differ. Bedrock matches Anthropic-direct on Claude at full parity. OpenAI runs automatic prefix caching. Gemini wants explicit cache management. Self-hosting on vLLM gets a free in-process KV cache — no list-price discount, but the FLOPs savings are real.

Provider · model Mechanism Off cached
Bedrock · Claude Explicit cache_control blocks 90 %
Bedrock · Nova Pro Explicit blocks ~ 75 %
Bedrock · Llama 4 Implicit prefix · automatic ~ 50 %
OpenAI direct · GPT-5 Auto prefix · zero config 50 %
Vertex · Gemini 3.1 Flash Context Caching API ~ 75 %
Alibaba · Qwen / DeepSeek Implicit prefix ~ 50 %
Self-host vLLM In-process KV cache no $-discount
Verdict: Cache works everywhere meaningful — but our Sonnet-heavy mix lands on the best cache deal in the market. That's not coincidence; it's why Sonnet is the default lane.
If we break the AWS-only rule once

Best single swap: Gemini 3.1 Flash via Vertex into the long-context lane. Beats Llama 4 Scout on multimodal (PDF charts), ties on text — at similar cost. One extra vendor; opens multimodal.

If we go full multi-provider

The 5-vendor lineup (Claude + GPT-5 + Gemini + DeepSeek + MiniMax) tops the eval ~3 – 5 %. Roughly the same $. Operationally ~3 × the load (auth, billing, ZDR, on-call). Not a Phase 1 bet.

If the CEO insists on Gemma 4

Tell them: we'll deploy it the day we own the metal. On rented AWS GPUs it's 17 – 35 × the Bedrock equivalent. The maths only flips when the GPUs are paid-for.

The cost lever · 4 caches

Each layer chops a different cost. Stack them.

1

Embedding cache · ElastiCache

hash(query) → vector. 7-day TTL. Catches duplicate phrasing across analysts asking variants of the same thing.

35 %
hit rate
saves $3 / mo on embedding · really saves the 25 ms of latency
2

Retrieval cache · ElastiCache

hash(embedding + filters) → top-K doc IDs. 10-min TTL — keeps fresh notes surfacing fast.

25 %
hit rate
saves on OpenSearch + Bedrock Rerank round-trips · ~150 ms p50
3

Bedrock + Alibaba prompt cache Biggest saver

Bedrock supports prompt caching for Claude (Sonnet / Opus / Haiku) + Nova at full Anthropic-parity discounts — $0.30/M cached vs $3/M uncached on Sonnet. Alibaba DashScope offers comparable cached-read pricing for Qwen3 Max and DeepSeek. ~70 % of our 10.5 k input tokens hit cache on warm sessions.

90 %
input savings
saves ~$3.5 k / mo · zero quality cost · just a header to enable
4

Semantic answer cache

Cosine ≥ 0.96 → cached answer. 24-hr TTL · invalidated on new ingest in the relevant sector. Only serves on a current-context hit, otherwise bypassed.

15 %
hit rate
saves the full inference cost · ~$900 / mo · be conservative on TTL
Visualization · structured artifacts

Tables and charts come from tools, not markdown.

Asking an LLM to draw an ASCII table is asking for off-by-one errors. Numbers come from get_quote / get_financials as JSON; the model wraps them in prose. The frontend renders proper tables and Vega-Lite charts from the structured tool outputs. Streaming is text + artifacts interleaved.

Pattern A · Table
Tool returns rows
{
  "headers": ["Ticker", "Px", "TP", "Δ%"],
  "rows": [
    ["TCEHY", 412.5, 480, "+16%"],
    ["BABA", 81.2, 110, "+35%"]
  ]
}
Frontend renders an actual <table> with tnum, sortable, exportable to Excel.
Pattern B · Chart
Vega-Lite spec
{
  "mark": "line",
  "data": "get_chart_data(TCEHY, 1Y)",
  "encoding": {
    "x": "date",
    "y": "close"
  }
}
Vega-Lite is renderer-agnostic. The model picks the chart type; the data comes from the tool, not the model.
Pattern C · Citation
Inline footnote
"...CLSA reiterated O-PF
[¹note_id=84319, p.4]"
Every claim sourced. Hover footnote → preview the source paragraph. Refusal-to-answer is the default if no citation is available.
The H20 question · the one the room actually wants answered

Six H20s buys 6× the cost for ~1× the user-side gain.

Verdict
Don't buy.
unless Compliance forces — then buy two.
What H20 actually is
China-export silicon · memory-preserved, compute-cut.
H20 H100 H200
BF16 TFLOPS 148 989 989
VRAM (GB) 96 80 141
HBM bw (TB/s) 4.0 3.4 4.8
Decode (memory-bound) competitive baseline +30 %
Prefill (compute-bound) ~6× slower baseline baseline
Memory-bandwidth parity is why H20 inference looks okay on benchmarks. Long-context prefill is where the compute cut hurts — and 50-page reports are exactly that workload.
How many users it carries
8-GPU H20 server · DeepSeek V4 FP8 · realistic numbers.
Sustained output ~2,000 tok/s
Concurrent streaming users (80 tok/s/user) ~25
Daily messages capacity ~70 k
We see ~27 k msg/day at 600 k/mo (200 / analyst), peaking ~5×. 3 servers cover peak with HA + ~20 % headroom. 6 servers sit at ~50 % utilization most of the day.
When H20 becomes rational
Three trigger conditions. Need at least one.
  • Compliance mandates on-prem. CSRC / SFC / internal counsel says research bodies cannot leave HK. Buy 2; run open weights for everything; cloud for nothing sensitive.
  • Fine-tuning becomes a product line. We want to train a CLSA-tuned 70 B model on internal notes. Then 4 – 6 servers earns its keep.
  • Scale jumps 5×. 15 k MAU + agentic tool loops + multi-modal inputs. At that point cloud bills also rise to where hybrid breaks even.
None of these are true today. Re-evaluate quarterly.
1
machine
8 GPU · 768 GB VRAM
tight
Just covers peak. Single point of failure. Pilot only.
~$285 k / yr 5-yr equiv · 1 × throughput
2
machines
16 GPU · 1.5 TB VRAM
right size
Active-active HA. ~20 % headroom for traffic growth or model swaps. If on-prem becomes mandatory, this is the answer.
~$213 k / yr · 2 × throughput · saves $1.05 M capex vs Laurent's quote
6
machines
48 GPU · 4.6 TB VRAM
over-built
Sized for fine-tuning + multi-model + training. For pure inference at our volume: ~30 % peak utilization.
~$640 k / yr · 6 × throughput we don't need
The monthly bill · where every dollar lands

$14,292 a month. Sonnet is ~60 % of it.

Bedrock · SG Alibaba · HK Prompt-cache · 70 % warm
Monthly all-in · 600 k msg · 200 / analyst
$14,292 / month
Sonnet 58 %
Opus
Rerank
Qwen
Haiku
Cheap
AWS infra
Sonnet 4.6 · Bedrock $8,330
Opus 4.7 · Bedrock $1,596
Cohere Rerank 3 · Bedrock $1,200
Qwen3 Max · Alibaba $864
Haiku 4.5 · Bedrock $800
Nova Pro · Bedrock $346
DeepSeek V4 · Alibaba $74
Llama 4 Scout · Bedrock $86
ECS Fargate + ALB $300
Aurora + ElastiCache $200
OpenSearch · S3 · embed $240
CloudWatch + X-Ray + AMG $250
Sensitivity · how the bill scales with usage per analyst
token-cost models scale linearly · AWS infra ($990 / mo) is fixed
Light · 100 msg / analyst
$30.55
$7,640 / mo · 300 k msg · 3.45 B tok ~0.10 % of one Bloomberg seat
Baseline · 200 msg / analyst ★
$57.15
$14,290 / mo · 600 k msg · 6.90 B tok ~0.19 % of one Bloomberg seat
Heavy · 300 msg / analyst
$83.70
$20,930 / mo · 900 k msg · 10.35 B tok ~0.28 % of one Bloomberg seat
Per query (avg) $0.024
Per analyst · month $4.76
Per analyst · year $57.17
vs 6× H20 over 5 yr save $2.34 M
Sensitivity · what if we're wrong about volume

Cloud wins from 100 to 300 msg / analyst. Hybrid only when MAU triples.

Scenario MAU msg / mo Cloud / yr 2× H20 / yr 6× H20 / yr Best path
Light · 100 msg / analyst 3 k 300 k $92 k $213 k $640 k Cloud
Baseline · 200 msg / analyst 3 k 600 k $171 k $213 k $640 k Cloud
Heavy · 300 msg / analyst 3 k 900 k $251 k $330 k $640 k Cloud
Roll-out to whole firm (200 / user) 10 k 2 M $544 k $420 k $720 k 2× H20 hybrid
External-client agent (300 / user) 30 k 9 M $2.41 M $700 k $890 k 4× H20 hybrid
Compliance forces on-prem · today 3 k 600 k $213 k $640 k 2× H20 hybrid
Cloud-API cost scales near-linearly with messages (output tokens dominate). Self-host has a fixed-cost floor — past ~1.5 M msg/mo the curves cross. At 200 / analyst we're 2.5× below that. Re-decide every 6 months.
Quality × Cost · why the router exists

Each model wins a region. Routing is just respecting that.

← cheaper
more expensive →
↑ higher quality
Opus 4.7 GPT-5.5 Sonnet 4.6 GPT-5 mini Qwen3 Max Haiku 4.5 Nova Pro DeepSeek V4 Llama 4 Scout Gemma 4 27B Gemma 4 on L40S (self-host) DeepSeek V4 on H20 (self-host)

Frontier Claude (Opus, Sonnet) holds the upper-right via Bedrock SG. They cost more and deliver better answers on complex multi-doc finance work — used sparingly, cached aggressively. GPT-5.5 sits at the same tier (priciest, top-quality on numeric reasoning) but lives outside Bedrock — included for reference, not in the Phase-1 router.

Mid-tier specialists hold their own corners — Qwen3 Max on Alibaba (Mainland filings), Haiku 4.5 + Nova Pro on Bedrock (fast factual lookups, JSON-schema tool calls). GPT-5 mini slots near them on cost-quality but, again, outside Bedrock.

Llama 4 Scout sits in the lower-left: cheap, fast, 10 M context. Gemma 4 27B sits below it — slightly weaker on multi-doc reasoning, similar cost via OpenRouter / hosted endpoints. Reference point, not a router lane today. DeepSeek V4 covers cheap CN-EN heavy reasoning via Alibaba.

Self-host points land strictly south-east of the cloud equivalent — same weights, more cost, similar quality. Gemma 4 on a rented L40S is the most expensive way to be mid-tier. That's the trade we'd be buying.

Compliance · risk · BCM

What can break this plan. How we cover each break.

Data residency

All Western-model traffic stays inside Bedrock ap-southeast-1 — AWS contractually does not retain or train on inference data. Chinese-model traffic goes to Alibaba HK / SG with the same enterprise no-retention terms. Vector store + episodic memory live in OpenSearch ap-southeast-1, never leaves region.

Citation discipline

A hallucinated CLSA target price is a regulatory event. The orchestrator refuses to answer without an MCP-backed citation. Output schema validates the citation field. Eval harness asserts citation-presence on the golden set.

Provider outage

Each lane has a fallback. If Bedrock SG flakes on Sonnet → cross-region inference to ap-northeast-1 (Tokyo) at parity pricing. If Bedrock-wide outage → degrade to Alibaba (Qwen3 Max) with a banner. DeepSeek V4 stays as the open-weights last resort, callable on the future H20 box if compliance ever forces full self-host.

Entitlement leakage

Every retrieved chunk passes through the entitlement layer before context assembly. A junior analyst's prompt cannot pull a senior-only note, even if the embedding is "close." Filter happens pre-rerank — not post.

Eval drift

Golden eval (200 senior-rated questions) runs nightly. Each model's score is tracked over time; a regression triggers a router-rule recompute. Without this discipline, "vibes" wins every argument and the cost story collapses.

Note freshness

New notes appear in retrieval within 5 min of publish — embed pipeline is event-driven on the publish webhook. Retrieval-cache TTL is 10 min, so freshness has a tight SLA.

Roadmap · how we get there

Four phases, four months. Hybrid decision lives in phase 3.

Phase 0 · M0–1
Foundation
Embed corpus with Cohere embed-v3 (Bedrock) · stand up OpenSearch in ap-southeast-1 · basic RAG with Sonnet 4.6 · golden eval v1 (200 Q) · pilot 50 analysts.
Phase 1 · M1–3
Production
Multi-model router (Bedrock + Alibaba) · prompt-cache enabled · 3 MCP servers (Research + Market Data + News) · structured artifacts + citations · roll to 1 k users.
Phase 2 · M3–6
Scale
4-layer cache · episodic memory · Models + User MCP · semantic answer cache · roll to all 3 k analysts · CN-language polish (Qwen + DeepSeek).
Phase 3 · M6–12
Optimize · decide
Measured re-decision: hybrid 2× H20 if compliance / scale demands · embedding fine-tune on CLSA query log · evaluate fine-tuned 70 B for the cheap-path lane.
For the steering meeting

Five questions to settle before phase 1 ships.

Compliance sign-off

Does Compliance accept research-note bodies traversing AWS Bedrock SG + Alibaba HK with no-retention contracts? If yes — Phase 1 ships cloud. If no — Phase 1 still ships cloud for non-sensitive flows; 2× H20 procurement starts in parallel for the sensitive subset.

Eval ownership

Who owns the 200-Q golden set and the nightly score? Without a named owner the router devolves to "vibes" inside a quarter.

MCP team boundaries

Research desk, Market Data, News desk, and Quants team each own one MCP server (User MCP stays with the AI platform team). Are those teams staffed for this, or does the AI team carry all five through Phase 2?

Visibility scope

Should chat history be visible to a manager / mentor for analyst training? Yes / no answers shape episodic memory's privacy boundary.

Budget envelope

$171 k/yr is the recommended path at 200 msg/analyst. Is the firm comfortable above $250 k/yr if usage spikes to 300/analyst (or doubles to 400), or do we put a router-level budget cap (escalation lane disabled past $X)?

External clients

Is this analyst-only, or do we eventually expose this to institutional buy-side clients? "Yes" pushes us into Phase 3 (10 k+ MAU) and changes the GPU answer.

Cost figures use mid-2026 list prices from the Pricing tab. All Western models priced at AWS Bedrock ap-southeast-1 (Singapore) list rates; Bedrock matches Anthropic-direct on Claude (Sonnet / Opus / Haiku) at full prompt-cache parity, and Nova / Llama / Mistral are first-party Bedrock pricing. Chinese models priced at Alibaba Cloud DashScope (HK) list rates with prompt-cache discount applied where supported. Token math: 10.5 k input + 1 k output per query (system + 8 reranked chunks + 6-turn history + structured output budget), 70 % cache eligibility on warm sessions. Self-host annualized cost prorates Laurent's 6× quote ($1.7 M capex / 5 yr + $296 k/yr opex) to per-machine and rebuilds opex at sub-linear scaling. H20 throughput numbers assume DeepSeek V4 671 B / 37 B-active FP8 with vLLM-class serving. Cross-region inference (Bedrock CRI) treated at parity. Pricing dashboard: /strategy and /gpu-machines.