AWS Bedrock + Alibaba Cloud
- Best model on every query class
- No hardware to buy or maintain
- Switch models in a config file as new ones ship
- Zero capex · zero hardware risk
We can give 3 k analysts a Perplexity-Finance-grade research assistant for ~0.2 % of one Bloomberg seat per analyst — by routing each question to the best model on AWS Bedrock or Alibaba Cloud, instead of buying $1.7 M of GPUs that would sit at 30 % utilization.
Why these defaults: 50 k registered users, 1.5 k MAU today. The 3 k baseline is a 2 × target — headroom for the assistant becoming a daily habit, not a one-time sizing for today's MAU.
Same product. Same RAG plumbing. Same UX. The only thing that changes is where the model weights live and who writes the cheque.
Cost, risk, alternatives, the GPU thing, the Gemma thing, what happens during an outage. Click any to expand.
At 3 k analysts averaging 200 messages each, we expect ~600 k messages a month. Cloud APIs (Bedrock + Alibaba) cost ~$171 k a year for that workload. Six H20 servers cost ~$640 k a year (capex amortized + opex) for the same load — and they sit at 30 % utilization most of the day.
We'd be paying 4× more for measurably worse quality on premium queries (open-weights models still trail Sonnet 4.6 / Opus 4.7 on multi-doc finance reasoning) and giving up the ability to swap to better models as they ship.
We have a Plan B already drawn: 2 H20 servers (not 6) running open-weights models on-prem for sensitive queries, with cloud still used for premium queries that pass legal review. That's $213 k/yr — half the cost of the 6-machine quote, with ~20 % headroom for growth.
We need Compliance to give a clear ruling before Phase 1 ships. The default plan assumes they accept Bedrock SG + Alibaba HK with no-retention contracts.
ChatGPT Enterprise gives analysts a great chat UX, but it doesn't know about CLSA's 300 k-report archive, our valuation models, our target prices, or our entitlement rules. This system is the answer to "Q&A on our own data" — that's the thing ChatGPT can't do.
GPT-5 itself can be one of the models in our router (we considered it; it's not on AWS Bedrock so we'd add a vendor) — but it can't be the whole product.
The weights are free; the deployment isn't. Gemma 4 27B on a single AWS L40S GPU runs ~$1,500/mo. The same workload via Bedrock Llama 4 Scout costs ~$86/mo. That's 17× more expensive — and quality is also lower than the cloud model.
Self-hosting only makes sense when the GPUs are already paid-for (the 2× H20 hybrid scenario). On rented cloud GPUs, it's strictly worse.
For Claude: yes, parity. Bedrock Singapore matches Anthropic-direct pricing for Sonnet / Opus / Haiku, including the full 90 %-off prompt-cache discount. For Amazon Nova and Llama 4: same provider (AWS), so trivially yes.
The trade-off is that GPT-5, Gemini, and MiniMax aren't on Bedrock. We're not using them in the recommended Phase-1 lineup. If eval shows we're losing >5 % quality by skipping them, we add Vertex / OpenAI direct in Phase 3.
Each lane has a fallback. Sonnet → cross-region inference to Tokyo (Bedrock CRI, parity pricing). Bedrock-wide outage → degrade to Alibaba (Qwen3 Max) with a banner. Alibaba outage → degrade to Bedrock-only with the cheap-path lanes routed to Haiku.
We've never had both providers down simultaneously. If we ever did, the system shows a status banner and falls back to "search-only" mode (retrieval works, generation is paused) until a provider returns.
Quarterly, but specifically in Month 9 (Q4 2026) we re-evaluate the GPU question with real usage data. Triggers for re-opening:
(1) Compliance changes the data-residency rules. (2) Volume crosses 6 M messages/month (10× our baseline). (3) A new frontier model launches that we can only run on-prem. (4) Total cloud spend crosses $400 k/yr.
Four things, listed in the next section: the $171 k/yr Phase-1 budget, the Bedrock + Alibaba no-retention contracts (Compliance), the named owner of the golden eval (Research desk), and the Q4 2026 re-decision date for the GPU question.
Frontier (top-right): Opus 4.7 + Sonnet 4.6 (Bedrock SG) and GPT-5.5 (OpenAI direct, reference only) cluster at the highest quality. Used sparingly, cached aggressively.
Mid-tier specialists: Qwen3 Max for Mainland filings, Haiku + Nova Pro for fast factual lookups, GPT-5 mini as an OpenAI-direct alternative.
Cheap path (lower-left): Llama 4 Scout (10 M context), DeepSeek V4 (CN-EN translation), Gemma 4 27B (open weights). All cloud-priced.
Self-host points (muted) sit south-east of the cloud equivalent — same weights, more cost, similar quality. That's the trade we'd be buying.
The numbers on this page are mid-2026 list prices. Western models priced at AWS Bedrock ap-southeast-1; Chinese models priced at Alibaba Cloud DashScope HK, with prompt-cache discounts applied where supported (Bedrock Claude at 90 %, Nova at 75 %, Llama at 50 %; Alibaba Qwen / DeepSeek at ~50 %). Self-host annualized cost prorates Laurent's 6× quote ($1.7 M capex / 5 yr + $296 k/yr opex) per machine.
For the full architecture, model lineup, MCP design, cache strategy, GPU deep-dive, and all working math, see the technical brief →.
Reconnecting…
Network looks slow — attempting to reconnect
Reconnecting…
Server isn't responding — attempting to reconnect