← Pricing / Approach / Technical

Executive Technical

CLSA Research Q&A · 3 k analysts · Hong Kong Recommendation

Cloud-first, multi-model, MCP-native.
Skip the GPUs.

At 3 k analysts and ~600 k messages a month (200 / analyst), the right answer is not a server room — it's a router. Send the query to the model that wins on quality-per-dollar (Bedrock SG for Western frontier, Alibaba HK for Chinese models), retrieve from a 300 k-report OpenSearch index, cache four layers deep, and let MCP servers expose CLSA Research, Bloomberg, and our own valuation models as tools.

$57^.17

per analyst · per year · all-in. ~0.2 % of one Bloomberg seat.

Monthly bill

$14,292

Annual bill

$171 k

vs Perplexity Finance

3.1× cheaper

Verdict — five points

Run the assistant entirely on cloud APIs. Sonnet 4.6 default, six other models for niches.
Embed once with Cohere embed-v3 (Bedrock); serve from OpenSearch in ap-southeast-1 (sub-15 ms from HK).
Expose Research / Market-Data / News / Models / User as 5 MCP servers — all clients get the same tools.
Lean on Bedrock prompt cache — 70 %+ of input is system + retrieved chunks, all cacheable.
Do not buy 6 H20s. Right answer if compliance ever forces on-prem: two, hybrid.

Corpus

300 k

research reports indexed

Active users

3 k

monthly · 2× today's 1.5 k

Messages

600 k

/ month · 200 per analyst

Tokens

6.90 B

/ month · ~70 % cache-eligible

Models in rotation

routed by query class

Latency target

p50 3.5 s

first token < 800 ms

Why these defaults: 50 k registered users, 1.5 k MAU today. The 3 k baseline is a 2 × target — headroom for the assistant becoming a daily habit, not a one-time sizing for today's MAU.

5-year TCO · the only chart that matters

Three paths. The cheap one is also the smartest one.

Same RAG plumbing, same MCP tools, same UX. The only thing that changes is where the model weights live and who writes the cheque. Bars are 5-year all-in: capex amortized + opex + cloud APIs that remain after self-host.

Cloud headroom vs 6× H20

+$2.34 M

5-year saving · spend it on data scientists or stronger evals

Cloud + Router

recommended

$857 k

$171 k

/ year

2× H20 hybrid

defensible if compliance forces

$1.07 M

$213 k

/ year

6× H20 on-prem(Laurent's quote)

$1.7 M capex + $296 k/yr opex

$3.20 M

$640 k

/ year

Cloud at 600 k msg / mo includes 7 Bedrock + Alibaba models, Cohere Rerank 3, Cohere embeddings, OpenSearch, Aurora, ElastiCache, CloudWatch.

2× H20 hybrid self-hosts DeepSeek V4 / Qwen3 MoE for the cheap-path 50 % of traffic; premium queries still hit Sonnet / Opus on Bedrock.

6× H20 assumes 70 % of traffic served on-prem; remaining premium queries still hit cloud. ~60 % peak utilization — still over-built for inference alone.

The three live options

Pick once. Make the second-cheapest one your plan B.

Recommended A

Cloud-first router

Sonnet 4.6 default · Opus 4.7 · Haiku 4.5 · Nova Pro · Llama 4 Scout (Bedrock) + Qwen3 Max · DeepSeek V4 (Alibaba) · 4-layer cache.

$14.3 k

/ month

$171 k / yr · $857 k / 5 yr $4.76 / user-mo

Best quality on every query class — frontier models win on multi-doc finance.
Switch model in a YAML — no hardware to depreciate.
Bedrock ap-southeast-1 + Alibaba HK · < 60 ms from any analyst seat.
No-retention enterprise contracts on both providers.

Plan B B

2× H20 hybrid

Self-host DeepSeek V4 (or Qwen3 MoE) on 16 GPU · cheap path on-prem · premium queries on cloud.

$27 k

/ month

$213 k / yr · $1.07 M / 5 yr $9.05 / user-mo

Active-active HA · ~2 000 tok/s sustained on DeepSeek V4 FP8.
Quality regresses 5 – 10 % on premium queries — keep cloud for those.
Real value: data sovereignty if Compliance flags cloud egress.
Scale path: add a 3rd machine when MAU clears 6 k.

Over-built C

6× H20 on-prem

48 GPU · 4.6 TB VRAM · sized for training + multi-model. ~2× larger than inference demand at our volume.

$57 k

/ month avg

$640 k / yr · $3.20 M / 5 yr $19.03 / user-mo

Peak utilization ~60 % on inference alone — half the metal still sits idle.
H20 is compute-cut — same memory bandwidth as H100 but 5 – 7× less FLOPS.
$1 M+ capex over-spend vs Option B with no measurable user-side gain.
Only rational if we're also doing fine-tunes / training in-house.

The full stack · top to bottom

Eleven layers. Each one is a build-or-buy decision.

Buy the layers where the market is competitive (models, embeddings, rerankers, vector DB). Build the layers that are the product (router, memory, MCP tools, eval harness). Skip the layers that don't move quality (fine-tuning, custom inference servers).

Layer 1

Frontend

Streaming chat UI · structured artifacts · client-rendered tables and charts

Tables and charts render client-side from tool-call JSON, not markdown — guaranteed correctness.

$200/mo

CloudFront + ALB · ap-southeast-1

Layer 2

Auth + sessions

SSO via existing CLSA SAML · session state in Postgres

Tier and entitlements come from upstream — no auth invented in this stack.

~$0

existing infra

Layer 3

Memory

Conversational + user-profile + episodic — three stores, one assembler

Aurora PostgreSQL for user profile + watchlist · OpenSearch Serverless for episodic memory · ElastiCache Redis for last-N turns.

$200/mo

Aurora + ElastiCache

Layer 4

Backend

Router → Memory → RAG → Generate · in-house orchestrator, no framework lock-in

Streamed token-by-token. Tool calls run in parallel. Per-session process isolation, cancellable.

build

~3 eng-weeks

Layer 5

Router

Heuristics + Gemini 3.1 Flash-Lite classifier · 7 model lanes

< 50 ms classification. Falls back to Sonnet 4.6 on ambiguity. Hot-reloadable rules.

$30/mo

Flash classifier

Layer 6

Retrieval

Hybrid: dense (Cohere embed-v3) + BM25 + metadata filters · top-50 → Cohere Rerank 3 → top-8

All via Bedrock SG. BM25 catches ticker / analyst-name lookups that pure semantics miss. Reranker is the quality lever.

$700/mo

Bedrock Cohere · OpenSearch

Layer 7

Embeddings

Cohere embed-v3 multilingual via Bedrock · 1024-dim · EN ↔ ZH solid

$750 one-time bulk for the 7.5 B-token corpus · $2 / mo incremental + queries · stays inside AWS.

$5/mo

Bedrock embed

Layer 8

Generation

7 lanes on Bedrock SG · ranked by quality-per-dollar on a CLSA golden eval

Sonnet 4.6 · Opus 4.7 · Haiku 4.5 · Nova Pro · Llama 4 Scout · DeepSeek V3.1 · Mistral Large 2.

$6.1 k/mo

Bedrock · post-cache

Layer 9

Tools (MCP)

5 MCP servers — Research, Market Data, News, Models, User

One protocol, all clients. Bloomberg / Refinitiv proxied; CLSA models native; user prefs sandboxed.

build

~6 eng-weeks

Layer 10

Storage

OpenSearch Serverless (vectors) · Aurora PostgreSQL (transactional) · S3 (reports) · ElastiCache (hot cache)

All in ap-southeast-1. Cross-region S3 replication to ap-northeast-1 off-hours for DR.

$330/mo

OpenSearch + S3

Layer 11

Observability

Per-query trace · token cost · cache hit · model picked · eval score (golden re-run nightly)

If we can't see what the router picked and why, we can't tune it. This is the discipline lever.

$250/mo

CloudWatch + X-Ray + Managed Grafana

A question becomes an answer · 8 stages

Read it left-to-right. p50 ≈ 3.5 s end-to-end.

first token < 800 ms · streaming · cancellable

1 · Question

User query

+ last 6 turns · user profile loaded

2 · Route

Classify · 1 of 7 lanes

language · complexity · tools · cost-tier

3 · Embed

Cohere embed-v3 · Bedrock

~25 ms · 35 % ElastiCache hit

4 · Retrieve

OpenSearch hybrid

dense + BM25 + filters → top-50

5 · Rerank

Cohere 3.5

top-50 → top-8 · ~120 ms

6 · Memory

Assemble context

short-term + user + episodic

7 · Generate

Routed model + MCP tools

8 · Render

Stream → artifacts

text + tables + Vega-Lite charts + citations

Embedding · the foundation

Embed once. Stay inside AWS.

The corpus is 300 k reports × ~50 pages × ~500 words ≈ 7.5 B tokens. We do this once. Cohere embed-v3 multilingual is hosted natively on Bedrock — no data leaves AWS, no extra vendor, multilingual EN ↔ ZH is solid for HK + Mainland coverage. Best-in-class for finance specifically would be Voyage-3 (top of MTEB Finance), but it's outside AWS — keep it as a future swap if eval shows the gap matters.

Embedder	Where it runs	Multilingual	$ / 1 M tok	Bulk cost	Yearly Δ
Cohere embed-v3 multilingual	Bedrock SG	100+ langs · solid ZH	$0.10	$750	$25
Amazon Titan Text v2	Bedrock SG	EN-strong · ok ZH	$0.02	$150	$5
BGE-M3 (self-host on EC2 g6e)	EC2 SG · custom	Multilingual SOTA · open	$0 marginal	~$120 GPU-hrs	$300/mo HW
Qwen3-Embed (Alibaba DashScope)	Alibaba HK	Best ZH · solid EN	$0.07	$525	$18
Voyage-3-large (finance-tuned)	Voyage direct API · outside AWS	EN ↔ ZH	$0.18	$1,350	$45

Chunking · the un-glamorous lever

Hierarchical, parent-child: we embed at three granularities — paragraph (~500 tok), section (~2 k tok), report-summary (~250 tok). Retrieval hits at the paragraph level, but we send the parent section to the LLM so it has flow.

Citations as first-class data: each chunk carries report_id · page · para. The model is forbidden from answering without one — refusal is preferable to a hallucinated target price on a CLSA letterhead.

Quantization: scalar quant (int8) on the 1024-dim vectors → ~22 GB storage for 15 M chunks. Fits in a single OpenSearch Serverless collection — no sharding, no replica gymnastics.

Memory · short, long, episodic

Three stores. One assembler.

Frameworks like Letta and Mem0 are fine, but for finance Q&A the retrieval ranking is the product — we want explicit control over what hits the prompt. Three stores, one in-house assembler, ~600 lines of code.

Short-term · conversational

Last 6 turns, in-memory

Held in the LiveView process; rolls off after 6 turns or 30 min idle. ~3 k tokens of context, never embedded.

~3 k tok / msg 0 ms · in-process

Long-term · user profile

Postgres · per-analyst record

Watchlist (tickers), sector coverage, role (sales vs research), language preference, tier. Loaded into prompt header — query rewriting honors the profile (e.g., "show CLSA's view on Tencent" already knows this user follows TCEHY).

~500 tok / msg < 5 ms · single SQL

Long-term · episodic

Past conversations · embedded + ranked

Each completed session is summarized + embedded by the same Cohere embed-v3 model. "What did we conclude on Tencent last week?" hits a separate OpenSearch collection, scoped to user_id. Top-2 episodic hits land in the prompt only when their cosine clears 0.78.

~1 k tok when relevant ~30 ms · OpenSearch scoped

Tools · MCP-first

Five MCP servers. Same tools, every model.

MCP (Model Context Protocol) is the lingua franca for tool definitions. Anthropic, OpenAI, Google all consume it natively. We expose CLSA's data + models + news once, behind a single auth/entitlement gateway, and every router lane gets the same capabilities — no per-vendor adapters, no tool-list drift, no privilege escalation.

Servers

one per data domain

Tools exposed

across the 5 servers

Calls / day

~80 k

~4 tool calls per query avg

Latency p50

28 ms

gateway → backend round-trip

Auth model

SAML

CLSA SSO + per-tool entitlement

MCP / 1 · Research

clsa-research

search_research_notes(query, filters) RAG entry point — hits OpenSearch hybrid index
get_note(note_id) Full note body + metadata + citation anchors
get_target_history(ticker) CLSA target-price evolution over time
list_recent_notes(sector, days) Fresh-content scan, sector-scoped

MCP / 2 · Market data

clsa-marketdata

get_quote(ticker) Live + intraday · 15-min cache for non-RT seats
get_chart_data(ticker, period, indicators) OHLCV for Vega-Lite spec rendering
get_financials(ticker, statement, period) IS / BS / CF · annual + quarterly
get_estimates(ticker) Consensus + CLSA forecast side-by-side
get_filings(ticker, type) 10-K, 10-Q, 6-K, prospectus

MCP / 3 · News & sentiment

clsa-news

search_news(query, filters, since) Full-text news search · sector / region / source-aware
get_breaking(tickers, sectors) Last 60 min stories · ranked by impact score
get_sentiment(ticker, period) Rolling sentiment (FinBERT) · positive / negative / neutral counts
summarize_news(ticker, days) Daily / weekly digest with citation anchors

MCP / 4 · Valuation models

clsa-models

run_dcf(ticker, assumptions) CLSA-branded DCF with override-able WACC / growth
peer_comparison(ticker, peers, metrics) Comp table · P/E, EV/EBITDA, P/B, growth
run_screener(criteria) Multi-stock filter · "show me HK tech < 15× P/E with positive cash"
scenario(ticker, deltas) Bear / base / bull on key drivers

MCP / 5 · User context

clsa-user

get_watchlist() Tickers the analyst follows · sector tagged
add_watchlist(ticker) Write — requires explicit user confirm in chat
set_preference(key, value) Language, default region, output verbosity
get_recent_history(days) Past Q&A summaries for episodic recall

Why MCP

One protocol. Define the tool once; every model lane (Sonnet, Opus, Haiku, Nova Pro, Llama 4, Qwen3, DeepSeek) consumes the same spec via native MCP support — no per-vendor adapter code.

Why split into 5

Different teams own different surfaces. Research desk owns notes; Market Data owns Bloomberg proxy; News desk owns the wire feed; Quants own the valuation models. MCP draws clean ownership boundaries — and clean audit trails.

Why now

The protocol stuck. Anthropic native, OpenAI compat, Google compat as of 2026. Building on MCP means future agents — and any model we add to the router — inherit our entire toolkit free.

The router · 7 lanes

Sonnet handles half the volume. The other six pick up the niches.

Each lane has a measurable specialty on the CLSA golden eval (200 questions, senior-analyst-rated). Traffic share is set by that benchmark and re-tuned monthly. Western frontier + open-weights served via AWS Bedrock ap-southeast-1; Chinese models served via Alibaba Cloud DashScope (HK / SG). Both providers support prompt caching — the Sonnet lane is where it compounds hardest.

Lane	Model	Provider	Share	Strength	$ / mo	$ / query
Default · synthesis	Claude Sonnet 4.6	Bedrock SG	52 %	Multi-doc EN synthesis · agentic flows · prompt-cache amplifier	$4,165	$0.027
Cheap path · lookups	Claude Haiku 4.5	Bedrock SG	15 %	Fast factual lookups · "where did we mention X?" · summarization	$400	$0.009
Mainland CN research	Qwen3 Max	Alibaba HK	10 %	Native ZH · A-share / CSRC filings · prompt cache supported	$432	$0.014
Structured · numeric	a Amazon Nova Pro	Bedrock SG	8 %	JSON-schema outputs · MCP tool chains · cheap structured calls	$173	$0.007
Long-context lookups	L Llama 4 Scout	Bedrock SG	8 %	10 M context window · whole-report scans · cheapest input tokens	$43	$0.002
CN ↔ EN · cheap heavy	DeepSeek V4	Alibaba HK	5 %	Translation · cheap reasoning · 90 % off-peak discount window	$37	$0.002
Premium escalation	Claude Opus 4.7	Bedrock SG	2 %	DCF reviews · target-price challenges · audit-grade reasoning	$798	$0.133

Generation total: $6,048 / mo post-cache · ≈ $0.020 / query average. Sonnet handles half the volume but absorbs ~69 % of the LLM bill — which is exactly why the prompt-cache lever (next section) is the highest-leverage thing we ship.

Three questions the CEO will ask · answer them now

Consolidate? · Gemma 4? · Caching everywhere?

The lineup we just drew is one defensible answer — these are the three pushbacks it has to survive. Numbers are for our 600 k msg/mo baseline (200 / analyst) at mid-2026 list pricing.

Q1 · Provider mix Phase 3

Consolidate to Claude + DeepSeek + OpenAI + Gemini + MiniMax?

Each one wins a niche on our golden eval. GPT-5 is best on numeric / DCF challenges. Gemini 3.1 Flash beats Llama 4 Scout on multimodal long-context (charts inside PDFs). MiniMax M2 edges Sonnet on multi-tool agentic loops. But every additional provider is a contract, ZDR addendum, billing surface, and outage path.

Lever	Δ vs current
Premium-query quality	+ 3 – 5 %
Multimodal long-context	unlocks PDFs
Vendor count	2 → 5
Egress paths to manage	+ 3
$ delta / yr	~ flat

Verdict: Phase 1 ships Bedrock + Alibaba (current). Trigger consolidation in Phase 3 only if the golden eval shows a >5 % quality gap — or if Compliance signs off on multimodal PDFs going to Vertex.

Q2 · "Free" weights Myth-bust

Is Gemma 4 actually free? Should we self-host it?

The weights are free under Google's Gemma license. The deployment is not. A single L40S GPU runs Gemma 4 27B at FP8 — and that L40S costs ~$1,360 / mo on EC2 before EBS, networking, vLLM ops, monitoring, or HA.

Cheap-path lane · 48 k queries / mo	$ / mo	vs Bedrock
Llama 4 Scout · Bedrock (current)	$86	1 ×
Gemma 4 27B · g6e.4xlarge · 1 × L40S	$1,500	17 ×
Gemma 4 27B · 2 × L40S (HA + headroom)	$3,000	35 ×
Gemma 4 27B · on owned H20s (sunk cost)	$0 *	0 ×

And the quality argument is also weak: Gemma 4 27B trails Llama 4 Scout (109 B MoE) and DeepSeek V4 on most evals — there are stronger open-weights options if we ever do self-host.

Verdict: Self-hosting open weights on rented GPUs is 17 – 35 × more expensive than calling the same class of model on Bedrock. Gemma 4 only fits when we already own the GPUs (Phase 3 H20 hybrid) — and even then Llama 4 or DeepSeek V4 are stronger picks. Tell the CEO: "free weights" is real; "free deployment" is not.

Q3 · Cache support Confirmed

Does prompt caching work everywhere we'd send traffic?

Yes — but the discount and the API differ. Bedrock matches Anthropic-direct on Claude at full parity. OpenAI runs automatic prefix caching. Gemini wants explicit cache management. Self-hosting on vLLM gets a free in-process KV cache — no list-price discount, but the FLOPs savings are real.

Provider · model	Mechanism	Off cached
Bedrock · Claude	Explicit cache_control blocks	90 %
Bedrock · Nova Pro	Explicit blocks	~ 75 %
Bedrock · Llama 4	Implicit prefix · automatic	~ 50 %
OpenAI direct · GPT-5	Auto prefix · zero config	50 %
Vertex · Gemini 3.1 Flash	Context Caching API	~ 75 %
Alibaba · Qwen / DeepSeek	Implicit prefix	~ 50 %
Self-host vLLM	In-process KV cache	no $-discount

Verdict: Cache works everywhere meaningful — but our Sonnet-heavy mix lands on the best cache deal in the market. That's not coincidence; it's why Sonnet is the default lane.

If we break the AWS-only rule once

Best single swap: Gemini 3.1 Flash via Vertex into the long-context lane. Beats Llama 4 Scout on multimodal (PDF charts), ties on text — at similar cost. One extra vendor; opens multimodal.

If we go full multi-provider

The 5-vendor lineup (Claude + GPT-5 + Gemini + DeepSeek + MiniMax) tops the eval ~3 – 5 %. Roughly the same $. Operationally ~3 × the load (auth, billing, ZDR, on-call). Not a Phase 1 bet.

If the CEO insists on Gemma 4

Tell them: we'll deploy it the day we own the metal. On rented AWS GPUs it's 17 – 35 × the Bedrock equivalent. The maths only flips when the GPUs are paid-for.

The cost lever · 4 caches

Each layer chops a different cost. Stack them.

Embedding cache · ElastiCache

hash(query) → vector. 7-day TTL. Catches duplicate phrasing across analysts asking variants of the same thing.

35 %

hit rate

saves $3 / mo on embedding · really saves the 25 ms of latency

Retrieval cache · ElastiCache

hash(embedding + filters) → top-K doc IDs. 10-min TTL — keeps fresh notes surfacing fast.

25 %

hit rate

saves on OpenSearch + Bedrock Rerank round-trips · ~150 ms p50

Bedrock + Alibaba prompt cache Biggest saver

Bedrock supports prompt caching for Claude (Sonnet / Opus / Haiku) + Nova at full Anthropic-parity discounts — $0.30/M cached vs $3/M uncached on Sonnet. Alibaba DashScope offers comparable cached-read pricing for Qwen3 Max and DeepSeek. ~70 % of our 10.5 k input tokens hit cache on warm sessions.

90 %

input savings

saves ~$3.5 k / mo · zero quality cost · just a header to enable

Semantic answer cache

Cosine ≥ 0.96 → cached answer. 24-hr TTL · invalidated on new ingest in the relevant sector. Only serves on a current-context hit, otherwise bypassed.

15 %

hit rate

saves the full inference cost · ~$900 / mo · be conservative on TTL

Visualization · structured artifacts

Tables and charts come from tools, not markdown.

Asking an LLM to draw an ASCII table is asking for off-by-one errors. Numbers come from get_quote / get_financials as JSON; the model wraps them in prose. The frontend renders proper tables and Vega-Lite charts from the structured tool outputs. Streaming is text + artifacts interleaved.

Pattern A · Table

Tool returns rows

{
  "headers": ["Ticker", "Px", "TP", "Δ%"],
  "rows": [
    ["TCEHY", 412.5, 480, "+16%"],
    ["BABA", 81.2, 110, "+35%"]
  ]
}

Frontend renders an actual <table> with tnum, sortable, exportable to Excel.

Pattern B · Chart

Vega-Lite spec

{
  "mark": "line",
  "data": "get_chart_data(TCEHY, 1Y)",
  "encoding": {
    "x": "date",
    "y": "close"
  }
}

Vega-Lite is renderer-agnostic. The model picks the chart type; the data comes from the tool, not the model.

Pattern C · Citation

Inline footnote

"...CLSA reiterated O-PF
[¹note_id=84319, p.4]"

Every claim sourced. Hover footnote → preview the source paragraph. Refusal-to-answer is the default if no citation is available.

The H20 question · the one the room actually wants answered

Six H20s buys 6× the cost for ~1× the user-side gain.

Verdict

Don't buy.

unless Compliance forces — then buy two.

What H20 actually is

China-export silicon · memory-preserved, compute-cut.

	H20	H100	H200
BF16 TFLOPS	148	989	989
VRAM (GB)	96	80	141
HBM bw (TB/s)	4.0	3.4	4.8
Decode (memory-bound)	competitive	baseline	+30 %
Prefill (compute-bound)	~6× slower	baseline	baseline

Memory-bandwidth parity is why H20 inference looks okay on benchmarks. Long-context prefill is where the compute cut hurts — and 50-page reports are exactly that workload.

How many users it carries

8-GPU H20 server · DeepSeek V4 FP8 · realistic numbers.

Sustained output ~2,000 tok/s

Concurrent streaming users (80 tok/s/user) ~25

Daily messages capacity ~70 k

We see ~27 k msg/day at 600 k/mo (200 / analyst), peaking ~5×. 3 servers cover peak with HA + ~20 % headroom. 6 servers sit at ~50 % utilization most of the day.

When H20 becomes rational

Three trigger conditions. Need at least one.

Compliance mandates on-prem. CSRC / SFC / internal counsel says research bodies cannot leave HK. Buy 2; run open weights for everything; cloud for nothing sensitive.
Fine-tuning becomes a product line. We want to train a CLSA-tuned 70 B model on internal notes. Then 4 – 6 servers earns its keep.
Scale jumps 5×. 15 k MAU + agentic tool loops + multi-modal inputs. At that point cloud bills also rise to where hybrid breaks even.

None of these are true today. Re-evaluate quarterly.

machine

8 GPU · 768 GB VRAM

tight

Just covers peak. Single point of failure. Pilot only.

~$285 k / yr 5-yr equiv · 1 × throughput

machines

16 GPU · 1.5 TB VRAM

right size

Active-active HA. ~20 % headroom for traffic growth or model swaps. If on-prem becomes mandatory, this is the answer.

~$213 k / yr · 2 × throughput · saves $1.05 M capex vs Laurent's quote

machines

48 GPU · 4.6 TB VRAM

over-built

Sized for fine-tuning + multi-model + training. For pure inference at our volume: ~30 % peak utilization.

~$640 k / yr · 6 × throughput we don't need

The monthly bill · where every dollar lands

$14,292 a month. Sonnet is ~60 % of it.

AWS Bedrock · ap-southeast-1 Bedrock · SG Alibaba DashScope · HK Alibaba · HK Prompt-cache · 70 % warm

Monthly all-in · 600 k msg · 200 / analyst

$14,292 / month

Sonnet 58 %

Opus

Rerank

Qwen

Haiku

Cheap

AWS infra

Sonnet 4.6 · Bedrock $8,330

Opus 4.7 · Bedrock $1,596

Cohere Rerank 3 · Bedrock $1,200

Qwen3 Max · Alibaba $864

Haiku 4.5 · Bedrock $800

Nova Pro · Bedrock $346

DeepSeek V4 · Alibaba $74

Llama 4 Scout · Bedrock $86

ECS Fargate + ALB $300

Aurora + ElastiCache $200

OpenSearch · S3 · embed $240

CloudWatch + X-Ray + AMG $250

Sensitivity · how the bill scales with usage per analyst

token-cost models scale linearly · AWS infra ($990 / mo) is fixed

Light · 100 msg / analyst

$30.55

$7,640 / mo · 300 k msg · 3.45 B tok ~0.10 % of one Bloomberg seat

Baseline · 200 msg / analyst ★

$57.15

$14,290 / mo · 600 k msg · 6.90 B tok ~0.19 % of one Bloomberg seat

Heavy · 300 msg / analyst

$83.70

$20,930 / mo · 900 k msg · 10.35 B tok ~0.28 % of one Bloomberg seat

Per query (avg) $0.024

Per analyst · month $4.76

Per analyst · year $57.17

vs 6× H20 over 5 yr save $2.34 M

Sensitivity · what if we're wrong about volume

Cloud wins from 100 to 300 msg / analyst. Hybrid only when MAU triples.

Scenario	MAU	msg / mo	Cloud / yr	2× H20 / yr	6× H20 / yr	Best path
Light · 100 msg / analyst	3 k	300 k	$92 k	$213 k	$640 k	Cloud
Baseline · 200 msg / analyst	3 k	600 k	$171 k	$213 k	$640 k	Cloud
Heavy · 300 msg / analyst	3 k	900 k	$251 k	$330 k	$640 k	Cloud
Roll-out to whole firm (200 / user)	10 k	2 M	$544 k	$420 k	$720 k	2× H20 hybrid
External-client agent (300 / user)	30 k	9 M	$2.41 M	$700 k	$890 k	4× H20 hybrid
Compliance forces on-prem · today	3 k	600 k	—	$213 k	$640 k	2× H20 hybrid

Cloud-API cost scales near-linearly with messages (output tokens dominate). Self-host has a fixed-cost floor — past ~1.5 M msg/mo the curves cross. At 200 / analyst we're 2.5× below that. Re-decide every 6 months.

Quality × Cost · why the router exists

Each model wins a region. Routing is just respecting that.

← cheaper

more expensive →

↑ higher quality

Opus 4.7 GPT-5.5 Sonnet 4.6 GPT-5 mini Qwen3 Max Haiku 4.5 Nova Pro DeepSeek V4 Llama 4 Scout Gemma 4 27B Gemma 4 on L40S (self-host) DeepSeek V4 on H20 (self-host)

Frontier Claude (Opus, Sonnet) holds the upper-right via Bedrock SG. They cost more and deliver better answers on complex multi-doc finance work — used sparingly, cached aggressively. GPT-5.5 sits at the same tier (priciest, top-quality on numeric reasoning) but lives outside Bedrock — included for reference, not in the Phase-1 router.

Mid-tier specialists hold their own corners — Qwen3 Max on Alibaba (Mainland filings), Haiku 4.5 + Nova Pro on Bedrock (fast factual lookups, JSON-schema tool calls). GPT-5 mini slots near them on cost-quality but, again, outside Bedrock.

Llama 4 Scout sits in the lower-left: cheap, fast, 10 M context. Gemma 4 27B sits below it — slightly weaker on multi-doc reasoning, similar cost via OpenRouter / hosted endpoints. Reference point, not a router lane today. DeepSeek V4 covers cheap CN-EN heavy reasoning via Alibaba.

Self-host points land strictly south-east of the cloud equivalent — same weights, more cost, similar quality. Gemma 4 on a rented L40S is the most expensive way to be mid-tier. That's the trade we'd be buying.

Compliance · risk · BCM

What can break this plan. How we cover each break.

Data residency

All Western-model traffic stays inside Bedrock ap-southeast-1 — AWS contractually does not retain or train on inference data. Chinese-model traffic goes to Alibaba HK / SG with the same enterprise no-retention terms. Vector store + episodic memory live in OpenSearch ap-southeast-1, never leaves region.

Citation discipline

A hallucinated CLSA target price is a regulatory event. The orchestrator refuses to answer without an MCP-backed citation. Output schema validates the citation field. Eval harness asserts citation-presence on the golden set.

Provider outage

Each lane has a fallback. If Bedrock SG flakes on Sonnet → cross-region inference to ap-northeast-1 (Tokyo) at parity pricing. If Bedrock-wide outage → degrade to Alibaba (Qwen3 Max) with a banner. DeepSeek V4 stays as the open-weights last resort, callable on the future H20 box if compliance ever forces full self-host.

Entitlement leakage

Every retrieved chunk passes through the entitlement layer before context assembly. A junior analyst's prompt cannot pull a senior-only note, even if the embedding is "close." Filter happens pre-rerank — not post.

Eval drift

Golden eval (200 senior-rated questions) runs nightly. Each model's score is tracked over time; a regression triggers a router-rule recompute. Without this discipline, "vibes" wins every argument and the cost story collapses.

Note freshness

New notes appear in retrieval within 5 min of publish — embed pipeline is event-driven on the publish webhook. Retrieval-cache TTL is 10 min, so freshness has a tight SLA.

Roadmap · how we get there

Four phases, four months. Hybrid decision lives in phase 3.

Phase 0 · M0–1

Foundation

Embed corpus with Cohere embed-v3 (Bedrock) · stand up OpenSearch in ap-southeast-1 · basic RAG with Sonnet 4.6 · golden eval v1 (200 Q) · pilot 50 analysts.

Phase 1 · M1–3

Production

Multi-model router (Bedrock + Alibaba) · prompt-cache enabled · 3 MCP servers (Research + Market Data + News) · structured artifacts + citations · roll to 1 k users.

Phase 2 · M3–6

Scale

4-layer cache · episodic memory · Models + User MCP · semantic answer cache · roll to all 3 k analysts · CN-language polish (Qwen + DeepSeek).

Phase 3 · M6–12

Optimize · decide

Measured re-decision: hybrid 2× H20 if compliance / scale demands · embedding fine-tune on CLSA query log · evaluate fine-tuned 70 B for the cheap-path lane.

For the steering meeting

Five questions to settle before phase 1 ships.

Compliance sign-off

Does Compliance accept research-note bodies traversing AWS Bedrock SG + Alibaba HK with no-retention contracts? If yes — Phase 1 ships cloud. If no — Phase 1 still ships cloud for non-sensitive flows; 2× H20 procurement starts in parallel for the sensitive subset.

Eval ownership

Who owns the 200-Q golden set and the nightly score? Without a named owner the router devolves to "vibes" inside a quarter.

MCP team boundaries

Research desk, Market Data, News desk, and Quants team each own one MCP server (User MCP stays with the AI platform team). Are those teams staffed for this, or does the AI team carry all five through Phase 2?

Visibility scope

Should chat history be visible to a manager / mentor for analyst training? Yes / no answers shape episodic memory's privacy boundary.

Budget envelope

$171 k/yr is the recommended path at 200 msg/analyst. Is the firm comfortable above $250 k/yr if usage spikes to 300/analyst (or doubles to 400), or do we put a router-level budget cap (escalation lane disabled past $X)?

External clients

Is this analyst-only, or do we eventually expose this to institutional buy-side clients? "Yes" pushes us into Phase 3 (10 k+ MAU) and changes the GPU answer.

Cost figures use mid-2026 list prices from the Pricing tab. All Western models priced at AWS Bedrock ap-southeast-1 (Singapore) list rates; Bedrock matches Anthropic-direct on Claude (Sonnet / Opus / Haiku) at full prompt-cache parity, and Nova / Llama / Mistral are first-party Bedrock pricing. Chinese models priced at Alibaba Cloud DashScope (HK) list rates with prompt-cache discount applied where supported. Token math: 10.5 k input + 1 k output per query (system + 8 reranked chunks + 6-turn history + structured output budget), 70 % cache eligibility on warm sessions. Self-host annualized cost prorates Laurent's 6× quote ($1.7 M capex / 5 yr + $296 k/yr opex) to per-machine and rebuilds opex at sub-linear scaling. H20 throughput numbers assume DeepSeek V4 671 B / 37 B-active FP8 with vLLM-class serving. Cross-region inference (Bedrock CRI) treated at parity. Pricing dashboard: /strategy and /gpu-machines.

Cloud-first, multi-model, MCP-native. Skip the GPUs.

Three paths. The cheap one is also the smartest one.

Pick once. Make the second-cheapest one your plan B.

Cloud-first router

2× H20 hybrid

6× H20 on-prem

Eleven layers. Each one is a build-or-buy decision.

Read it left-to-right. p50 ≈ 3.5 s end-to-end.

Embed once. Stay inside AWS.

Three stores. One assembler.

Five MCP servers. Same tools, every model.

Sonnet handles half the volume. The other six pick up the niches.

Consolidate? · Gemma 4? · Caching everywhere?

Consolidate to Claude + DeepSeek + OpenAI + Gemini + MiniMax?

Is Gemma 4 actually free? Should we self-host it?

Does prompt caching work everywhere we'd send traffic?

Each layer chops a different cost. Stack them.

Embedding cache · ElastiCache

Retrieval cache · ElastiCache

Bedrock + Alibaba prompt cache Biggest saver

Semantic answer cache

Tables and charts come from tools, not markdown.

Six H20s buys 6× the cost for ~1× the user-side gain.

$14,292 a month. Sonnet is ~60 % of it.

Cloud wins from 100 to 300 msg / analyst. Hybrid only when MAU triples.

Each model wins a region. Routing is just respecting that.

What can break this plan. How we cover each break.

Data residency

Citation discipline

Provider outage

Entitlement leakage

Eval drift

Note freshness

Four phases, four months. Hybrid decision lives in phase 3.

Five questions to settle before phase 1 ships.

Compliance sign-off

Eval ownership

MCP team boundaries

Visibility scope

Budget envelope

External clients

Cloud-first, multi-model, MCP-native.
Skip the GPUs.