← Pricing / Approach / Executive
Executive Technical
Executive brief · CLSA Research Q&A · 3 k analysts · Hong Kong Recommendation

Best model for the job.
Skip the GPUs.

We can give 3 k analysts a Perplexity-Finance-grade research assistant for ~0.2 % of one Bloomberg seat per analyst — by routing each question to the best model on AWS Bedrock or Alibaba Cloud, instead of buying $1.7 M of GPUs that would sit at 30 % utilization.

Bedrock · SG Alibaba · HK No GPUs bought
$57.17
per analyst · per year · all-in.  ~0.2 % of one Bloomberg seat.
Monthly bill
$14,292
Annual bill
$171 k
vs Perplexity Finance
3.1× cheaper
Corpus
300 k
research reports indexed
Active users
3 k
monthly · 2× today's 1.5 k
Messages
600 k
/ month · 200 per analyst
Tokens
6.90 B
/ month · ~70 % cache-eligible
Per analyst / yr
$57.17
~0.2 % of one Bloomberg seat
5-yr saving vs 6× H20
$2.34 M
cloud vs $640 k/yr quote
Assumptions analysts × msg / month 600 k msg · $14,292 / mo · $171 k / yr

Why these defaults: 50 k registered users, 1.5 k MAU today. The 3 k baseline is a 2 × target — headroom for the assistant becoming a daily habit, not a one-time sizing for today's MAU.

The decision · in one picture

Three options. One answer.

Same product. Same RAG plumbing. Same UX. The only thing that changes is where the model weights live and who writes the cheque.

Recommended
$171k
/ year · cloud-first router

AWS Bedrock + Alibaba Cloud

  • Best model on every query class
  • No hardware to buy or maintain
  • Switch models in a config file as new ones ship
  • Zero capex · zero hardware risk
Plan B
$326k
/ year · 2× H20 hybrid

Buy 2 H20 servers

  • Only if Compliance forbids cloud egress
  • ~$570 k capex · 5-yr TCO $1.07 M
  • Runs open-weights on-prem · cloud still used for premium queries
  • Quality regresses 5–10 % on multi-doc work
$685k
/ year · 6× H20 quote

Buy 6 H20 servers

  • $1.7 M capex · 5-yr TCO $3.20 M
  • 3× over-sized for inference at our volume
  • Lower quality on premium queries vs frontier cloud
  • $2.34 M lost over 5 years vs cloud-first
5-year total cost · the only chart that matters

The cheap option is also the smartest.

Cloud + router
recommended
$857 k
$171 k
/ year avg
2× H20 hybrid
if compliance forces
$1.07 M
$213 k
/ year avg
6× H20 (Laurent's quote)
over-built · $1.7 M capex
$3.20 M
$640 k
/ year avg
Where every dollar goes

$14,292 a month. Sonnet absorbs ~60 %.

$14,292
/ month
Claude Sonnet 4.6 · default lane 58.3 % $8,330
Claude Opus 4.7 · premium escalation 11.2 % $1,596
Cohere Rerank 3 · retrieval reranker 8.4 % $1,200
Qwen3 Max · Mainland CN research 6.0 % $864
Claude Haiku 4.5 · cheap fast lookups 5.6 % $800
Nova Pro + Llama 4 Scout + DeepSeek V4 3.5 % $506
AWS infra · OpenSearch + Aurora + ECS + obs 6.9 % $990
Per analyst / mo
$4.76
Per query (avg)
$0.024
Per analyst / yr
$57.17
Tokens · monthly
6.90 B
Anatomy of an answer · end to end

Four steps. From question to cited answer.

Step 1 · Find
Retrieve the source pages
OpenSearch hybrid retrieval (dense + BM25) over the 300 k-report archive, reranked by Cohere to the top 8 most-relevant pages.
→ 8 grounded chunks · ~145 ms
Step 2 · Route
Pick the right model
A Gemini 3.1 Flash-Lite classifier tags the question by language, complexity, and tool-need — Sonnet for synthesis, Qwen for Mainland CN, Haiku for fast lookups.
→ 1 of 7 lanes · < 50 ms
Step 3 · Answer
Generate with live data
The model streams its response, calls MCP tools (Bloomberg, Refinitiv, CLSA models) for live numbers, and emits structured tables and charts as JSON artifacts.
→ streamed text + tool outputs · ~3 s
Step 4 · Cite
Anchor every claim
Each fact links back to the source note + page anchor. No citation, no answer — that's policy, not a feature. A hallucinated CLSA target would be a regulatory event.
→ refusal default · auditable trail
The plan · 12 months · four phases

Pilot in 60 days. Ship to 3 k by month 6. Re-decide GPUs in month 9.

Month 0–1
Foundation
Embed the 300 k-report corpus · stand up retrieval · wire Sonnet 4.6 · build the 200-question golden eval · pilot 50 senior analysts.
2
Month 1–3 · current
Production
Multi-model router live (Sonnet + Haiku + Qwen + 4 more) · prompt-cache enabled · Research + Market-Data + News MCP servers · roll to 1 k users.
3
Month 3–6
Scale to 3 k
All 4 caches stacked · episodic memory · Models + User MCP · CN-language polish · roll to all 3 k analysts.
4
Month 6–12
Re-decide
Measured re-decision: hybrid 2× H20 only if compliance / scale demands · embedding fine-tune on real query log · evaluate next-gen models as they ship.
If you're wondering

Eight questions asked. Eight answered.

Cost, risk, alternatives, the GPU thing, the Gemma thing, what happens during an outage. Click any to expand.

Why are we not buying GPUs?

At 3 k analysts averaging 200 messages each, we expect ~600 k messages a month. Cloud APIs (Bedrock + Alibaba) cost ~$171 k a year for that workload. Six H20 servers cost ~$640 k a year (capex amortized + opex) for the same load — and they sit at 30 % utilization most of the day.

We'd be paying 4× more for measurably worse quality on premium queries (open-weights models still trail Sonnet 4.6 / Opus 4.7 on multi-doc finance reasoning) and giving up the ability to swap to better models as they ship.

What if Compliance says research notes can't leave Hong Kong?

We have a Plan B already drawn: 2 H20 servers (not 6) running open-weights models on-prem for sensitive queries, with cloud still used for premium queries that pass legal review. That's $213 k/yr — half the cost of the 6-machine quote, with ~20 % headroom for growth.

We need Compliance to give a clear ruling before Phase 1 ships. The default plan assumes they accept Bedrock SG + Alibaba HK with no-retention contracts.

Why not just use ChatGPT Enterprise?

ChatGPT Enterprise gives analysts a great chat UX, but it doesn't know about CLSA's 300 k-report archive, our valuation models, our target prices, or our entitlement rules. This system is the answer to "Q&A on our own data" — that's the thing ChatGPT can't do.

GPT-5 itself can be one of the models in our router (we considered it; it's not on AWS Bedrock so we'd add a vendor) — but it can't be the whole product.

Can we self-host Gemma 4 / open-weights to save money?

The weights are free; the deployment isn't. Gemma 4 27B on a single AWS L40S GPU runs ~$1,500/mo. The same workload via Bedrock Llama 4 Scout costs ~$86/mo. That's 17× more expensive — and quality is also lower than the cloud model.

Self-hosting only makes sense when the GPUs are already paid-for (the 2× H20 hybrid scenario). On rented cloud GPUs, it's strictly worse.

Is AWS Bedrock as good as calling Anthropic / OpenAI directly?

For Claude: yes, parity. Bedrock Singapore matches Anthropic-direct pricing for Sonnet / Opus / Haiku, including the full 90 %-off prompt-cache discount. For Amazon Nova and Llama 4: same provider (AWS), so trivially yes.

The trade-off is that GPT-5, Gemini, and MiniMax aren't on Bedrock. We're not using them in the recommended Phase-1 lineup. If eval shows we're losing >5 % quality by skipping them, we add Vertex / OpenAI direct in Phase 3.

What's our risk if Bedrock or Alibaba has an outage?

Each lane has a fallback. Sonnet → cross-region inference to Tokyo (Bedrock CRI, parity pricing). Bedrock-wide outage → degrade to Alibaba (Qwen3 Max) with a banner. Alibaba outage → degrade to Bedrock-only with the cheap-path lanes routed to Haiku.

We've never had both providers down simultaneously. If we ever did, the system shows a status banner and falls back to "search-only" mode (retrieval works, generation is paused) until a provider returns.

When do we revisit this decision?

Quarterly, but specifically in Month 9 (Q4 2026) we re-evaluate the GPU question with real usage data. Triggers for re-opening:

(1) Compliance changes the data-residency rules. (2) Volume crosses 6 M messages/month (10× our baseline). (3) A new frontier model launches that we can only run on-prem. (4) Total cloud spend crosses $400 k/yr.

What does the steering committee actually need to approve?

Four things, listed in the next section: the $171 k/yr Phase-1 budget, the Bedrock + Alibaba no-retention contracts (Compliance), the named owner of the golden eval (Research desk), and the Q4 2026 re-decision date for the GPU question.

Quality × cost · the model landscape

Each model wins a region. Routing is just respecting that.

← cheaper
more expensive →
↑ higher quality
Opus 4.7 GPT-5.5 Sonnet 4.6 GPT-5 mini Qwen3 Max Haiku 4.5 Nova Pro DeepSeek V4 Llama 4 Scout Gemma 4 27B Gemma 4 on L40S (self-host) DeepSeek V4 on H20 (self-host)

Frontier (top-right): Opus 4.7 + Sonnet 4.6 (Bedrock SG) and GPT-5.5 (OpenAI direct, reference only) cluster at the highest quality. Used sparingly, cached aggressively.

Mid-tier specialists: Qwen3 Max for Mainland filings, Haiku + Nova Pro for fast factual lookups, GPT-5 mini as an OpenAI-direct alternative.

Cheap path (lower-left): Llama 4 Scout (10 M context), DeepSeek V4 (CN-EN translation), Gemma 4 27B (open weights). All cloud-priced.

Self-host points (muted) sit south-east of the cloud equivalent — same weights, more cost, similar quality. That's the trade we'd be buying.

For the steering meeting

Four decisions. Then we ship.

Phase 1 budget
$171 k / yr
1
Approve cloud-first + budget
$171 k/yr on AWS Bedrock SG + Alibaba HK · 2× current MAU baseline. Authorize Phase 1 ship to 1 k users by month 3.
2
Compliance sign-off
Bedrock + Alibaba no-retention contracts reviewed and accepted. Confirms research notes can traverse both providers.
3
Name the eval owner
Research desk owns the 200-question golden set. Without a named owner, the router devolves to "vibes" inside a quarter.
4
Set the re-decision date
Q4 2026 GPU re-evaluation locked on the calendar. Triggers above also re-open the question early if hit.

The numbers on this page are mid-2026 list prices. Western models priced at AWS Bedrock ap-southeast-1; Chinese models priced at Alibaba Cloud DashScope HK, with prompt-cache discounts applied where supported (Bedrock Claude at 90 %, Nova at 75 %, Llama at 50 %; Alibaba Qwen / DeepSeek at ~50 %). Self-host annualized cost prorates Laurent's 6× quote ($1.7 M capex / 5 yr + $296 k/yr opex) per machine.

For the full architecture, model lineup, MCP design, cache strategy, GPU deep-dive, and all working math, see the technical brief →.