← Pricing / Approach / Executive

Executive Technical

Executive brief · CLSA Research Q&A · 3 k analysts · Hong Kong Recommendation

Best model for the job.
Skip the GPUs.

We can give 3 k analysts a Perplexity-Finance-grade research assistant for ~0.2 % of one Bloomberg seat per analyst — by routing each question to the best model on AWS Bedrock or Alibaba Cloud, instead of buying $1.7 M of GPUs that would sit at 30 % utilization.

AWS Bedrock · ap-southeast-1 Bedrock · SG Alibaba · HK Cloud-first · skip the H20s No GPUs bought

$57^.17

per analyst · per year · all-in. ~0.2 % of one Bloomberg seat.

Monthly bill

$14,292

Annual bill

$171 k

vs Perplexity Finance

3.1× cheaper

Verdict — five points

Run the assistant entirely on cloud APIs. Sonnet 4.6 default, six other models for niches.
Embed once with Cohere embed-v3 (Bedrock); serve from OpenSearch in ap-southeast-1 (sub-15 ms from HK).
Expose Research / Market-Data / News / Models / User as 5 MCP servers — every model lane gets the same tools.
Lean on Bedrock prompt cache — 70 %+ of input is system + retrieved chunks, all cacheable. ~90 % off on warm sessions.
Do not buy 6 H20s. Right answer if compliance ever forces on-prem: two, hybrid.

Corpus

300 k

research reports indexed

Active users

3 k

monthly · 2× today's 1.5 k

Messages

600 k

/ month · 200 per analyst

Tokens

6.90 B

/ month · ~70 % cache-eligible

Per analyst / yr

$57.17

~0.2 % of one Bloomberg seat

5-yr saving vs 6× H20

$2.34 M

cloud vs $640 k/yr quote

Why these defaults: 50 k registered users, 1.5 k MAU today. The 3 k baseline is a 2 × target — headroom for the assistant becoming a daily habit, not a one-time sizing for today's MAU.

The decision · in one picture

Three options. One answer.

Same product. Same RAG plumbing. Same UX. The only thing that changes is where the model weights live and who writes the cheque.

Recommended

$171^k

/ year · cloud-first router

AWS Bedrock + Alibaba Cloud

Best model on every query class
No hardware to buy or maintain
Switch models in a config file as new ones ship
Zero capex · zero hardware risk

Plan B

$326^k

/ year · 2× H20 hybrid

Buy 2 H20 servers

Only if Compliance forbids cloud egress
~$570 k capex · 5-yr TCO $1.07 M
Runs open-weights on-prem · cloud still used for premium queries
Quality regresses 5–10 % on multi-doc work

Skip

$685^k

/ year · 6× H20 quote

Buy 6 H20 servers

$1.7 M capex · 5-yr TCO $3.20 M
3× over-sized for inference at our volume
Lower quality on premium queries vs frontier cloud
$2.34 M lost over 5 years vs cloud-first

5-year total cost · the only chart that matters

The cheap option is also the smartest.

Cloud + router

recommended

$857 k

$171 k

/ year avg

2× H20 hybrid

if compliance forces

$1.07 M

$213 k

/ year avg

6× H20 (Laurent's quote)

over-built · $1.7 M capex

$3.20 M

$640 k

/ year avg

Where every dollar goes

$14,292 a month. Sonnet absorbs ~60 %.

$14,292

/ month

Claude Sonnet 4.6 · default lane 58.3 % $8,330

Claude Opus 4.7 · premium escalation 11.2 % $1,596

Cohere Rerank 3 · retrieval reranker 8.4 % $1,200

Qwen3 Max · Mainland CN research 6.0 % $864

Claude Haiku 4.5 · cheap fast lookups 5.6 % $800

Nova Pro + Llama 4 Scout + DeepSeek V4 3.5 % $506

AWS infra · OpenSearch + Aurora + ECS + obs 6.9 % $990

Per analyst / mo

$4.76

Per query (avg)

$0.024

Per analyst / yr

$57.17

Tokens · monthly

6.90 B

Anatomy of an answer · end to end

Four steps. From question to cited answer.

Step 1 · Find

Retrieve the source pages

OpenSearch hybrid retrieval (dense + BM25) over the 300 k-report archive, reranked by Cohere to the top 8 most-relevant pages.

→ 8 grounded chunks · ~145 ms

Step 2 · Route

Pick the right model

A Gemini 3.1 Flash-Lite classifier tags the question by language, complexity, and tool-need — Sonnet for synthesis, Qwen for Mainland CN, Haiku for fast lookups.

→ 1 of 7 lanes · < 50 ms

Step 3 · Answer

Generate with live data

The model streams its response, calls MCP tools (Bloomberg, Refinitiv, CLSA models) for live numbers, and emits structured tables and charts as JSON artifacts.

→ streamed text + tool outputs · ~3 s

Step 4 · Cite

Anchor every claim

Each fact links back to the source note + page anchor. No citation, no answer — that's policy, not a feature. A hallucinated CLSA target would be a regulatory event.

→ refusal default · auditable trail

The plan · 12 months · four phases

Pilot in 60 days. Ship to 3 k by month 6. Re-decide GPUs in month 9.

✓

Month 0–1

Foundation

Embed the 300 k-report corpus · stand up retrieval · wire Sonnet 4.6 · build the 200-question golden eval · pilot 50 senior analysts.

Month 1–3 · current

Production

Multi-model router live (Sonnet + Haiku + Qwen + 4 more) · prompt-cache enabled · Research + Market-Data + News MCP servers · roll to 1 k users.

Month 3–6

Scale to 3 k

All 4 caches stacked · episodic memory · Models + User MCP · CN-language polish · roll to all 3 k analysts.

Month 6–12

Re-decide

Measured re-decision: hybrid 2× H20 only if compliance / scale demands · embedding fine-tune on real query log · evaluate next-gen models as they ship.

If you're wondering

Eight questions asked. Eight answered.

Cost, risk, alternatives, the GPU thing, the Gemma thing, what happens during an outage. Click any to expand.

Why are we not buying GPUs?

At 3 k analysts averaging 200 messages each, we expect ~600 k messages a month. Cloud APIs (Bedrock + Alibaba) cost ~$171 k a year for that workload. Six H20 servers cost ~$640 k a year (capex amortized + opex) for the same load — and they sit at 30 % utilization most of the day.

We'd be paying 4× more for measurably worse quality on premium queries (open-weights models still trail Sonnet 4.6 / Opus 4.7 on multi-doc finance reasoning) and giving up the ability to swap to better models as they ship.

What if Compliance says research notes can't leave Hong Kong?

We have a Plan B already drawn: 2 H20 servers (not 6) running open-weights models on-prem for sensitive queries, with cloud still used for premium queries that pass legal review. That's $213 k/yr — half the cost of the 6-machine quote, with ~20 % headroom for growth.

We need Compliance to give a clear ruling before Phase 1 ships. The default plan assumes they accept Bedrock SG + Alibaba HK with no-retention contracts.

Why not just use ChatGPT Enterprise?

ChatGPT Enterprise gives analysts a great chat UX, but it doesn't know about CLSA's 300 k-report archive, our valuation models, our target prices, or our entitlement rules. This system is the answer to "Q&A on our own data" — that's the thing ChatGPT can't do.

GPT-5 itself can be one of the models in our router (we considered it; it's not on AWS Bedrock so we'd add a vendor) — but it can't be the whole product.

Can we self-host Gemma 4 / open-weights to save money?

The weights are free; the deployment isn't. Gemma 4 27B on a single AWS L40S GPU runs ~$1,500/mo. The same workload via Bedrock Llama 4 Scout costs ~$86/mo. That's 17× more expensive — and quality is also lower than the cloud model.

Self-hosting only makes sense when the GPUs are already paid-for (the 2× H20 hybrid scenario). On rented cloud GPUs, it's strictly worse.

Is AWS Bedrock as good as calling Anthropic / OpenAI directly?

For Claude: yes, parity. Bedrock Singapore matches Anthropic-direct pricing for Sonnet / Opus / Haiku, including the full 90 %-off prompt-cache discount. For Amazon Nova and Llama 4: same provider (AWS), so trivially yes.

The trade-off is that GPT-5, Gemini, and MiniMax aren't on Bedrock. We're not using them in the recommended Phase-1 lineup. If eval shows we're losing >5 % quality by skipping them, we add Vertex / OpenAI direct in Phase 3.

What's our risk if Bedrock or Alibaba has an outage?

Each lane has a fallback. Sonnet → cross-region inference to Tokyo (Bedrock CRI, parity pricing). Bedrock-wide outage → degrade to Alibaba (Qwen3 Max) with a banner. Alibaba outage → degrade to Bedrock-only with the cheap-path lanes routed to Haiku.

We've never had both providers down simultaneously. If we ever did, the system shows a status banner and falls back to "search-only" mode (retrieval works, generation is paused) until a provider returns.

When do we revisit this decision?

Quarterly, but specifically in Month 9 (Q4 2026) we re-evaluate the GPU question with real usage data. Triggers for re-opening:

(1) Compliance changes the data-residency rules. (2) Volume crosses 6 M messages/month (10× our baseline). (3) A new frontier model launches that we can only run on-prem. (4) Total cloud spend crosses $400 k/yr.

What does the steering committee actually need to approve?

Four things, listed in the next section: the $171 k/yr Phase-1 budget, the Bedrock + Alibaba no-retention contracts (Compliance), the named owner of the golden eval (Research desk), and the Q4 2026 re-decision date for the GPU question.

Quality × cost · the model landscape

Each model wins a region. Routing is just respecting that.

← cheaper

more expensive →

↑ higher quality

Opus 4.7 GPT-5.5 Sonnet 4.6 GPT-5 mini Qwen3 Max Haiku 4.5 Nova Pro DeepSeek V4 Llama 4 Scout Gemma 4 27B Gemma 4 on L40S (self-host) DeepSeek V4 on H20 (self-host)

Frontier (top-right): Opus 4.7 + Sonnet 4.6 (Bedrock SG) and GPT-5.5 (OpenAI direct, reference only) cluster at the highest quality. Used sparingly, cached aggressively.

Mid-tier specialists: Qwen3 Max for Mainland filings, Haiku + Nova Pro for fast factual lookups, GPT-5 mini as an OpenAI-direct alternative.

Cheap path (lower-left): Llama 4 Scout (10 M context), DeepSeek V4 (CN-EN translation), Gemma 4 27B (open weights). All cloud-priced.

Self-host points (muted) sit south-east of the cloud equivalent — same weights, more cost, similar quality. That's the trade we'd be buying.

For the steering meeting

Four decisions. Then we ship.

Phase 1 budget

$171 k / yr

Approve cloud-first + budget

$171 k/yr on AWS Bedrock SG + Alibaba HK · 2× current MAU baseline. Authorize Phase 1 ship to 1 k users by month 3.

Compliance sign-off

Bedrock + Alibaba no-retention contracts reviewed and accepted. Confirms research notes can traverse both providers.

Name the eval owner

Research desk owns the 200-question golden set. Without a named owner, the router devolves to "vibes" inside a quarter.

Set the re-decision date

Q4 2026 GPU re-evaluation locked on the calendar. Triggers above also re-open the question early if hit.

The numbers on this page are mid-2026 list prices. Western models priced at AWS Bedrock ap-southeast-1; Chinese models priced at Alibaba Cloud DashScope HK, with prompt-cache discounts applied where supported (Bedrock Claude at 90 %, Nova at 75 %, Llama at 50 %; Alibaba Qwen / DeepSeek at ~50 %). Self-host annualized cost prorates Laurent's 6× quote ($1.7 M capex / 5 yr + $296 k/yr opex) per machine.

For the full architecture, model lineup, MCP design, cache strategy, GPU deep-dive, and all working math, see the technical brief →.

Best model for the job. Skip the GPUs.

Three options. One answer.

AWS Bedrock + Alibaba Cloud

Buy 2 H20 servers

Buy 6 H20 servers

The cheap option is also the smartest.

$14,292 a month. Sonnet absorbs ~60 %.

Four steps. From question to cited answer.

Pilot in 60 days. Ship to 3 k by month 6. Re-decide GPUs in month 9.

Eight questions asked. Eight answered.

Each model wins a region. Routing is just respecting that.

Four decisions. Then we ship.

Best model for the job.
Skip the GPUs.