06 · Questions

The questions you'll get. Crisp answers, no hedging.

Six decisions analysts and the desk heads will raise. Each verdict is derived from the same Cost numbers — change the workload above and the dollar references move with it.

Should we just use Gemma (or any open-weight) and skip the frontier models?

No. Use open-weights as a router lane for low-stakes ops only.

Gemma 4 31B accuracy drops ~25 pp on Golden Eval financial reasoning vs GPT-5 / DeepSeek V4 Pro.
Third-party hosting clears at ~$0.14 in / $0.40 out — saves under $200 / month at our scale.
Wrong question to optimize · token cost is < 0.1 % of one bad call to the desk.

Should we standardize on DeepSeek V4 Pro as the workhorse?

Yes — for the workhorse lane. Not for every query.

About 1/5 of GPT-5 per token · best CN-equity reasoning in our Golden Eval.
Weaker on EN long-form composition — the router falls back to GPT-5.5 there.
Already the default in the Platinum lineup · 30 % of blended traffic.

Should we rent GPUs and self-host the frontier models?

Not at current scale. Revisit at 4× usage or sustained Diamond.

2× H100 SG 1-yr commit = $736 k over 3 yr · fixed, regardless of usage.
Current cloud at Gold = $513 k over 3 yr · scales with the workload above.
Crossover only matters if utilization holds > 70 % · model agility lost the day we commit.

Should we buy GPUs (H20) in HK for data residency?

Only if compliance hard-requires on-prem inference.

2× H20 = $900 k over 3 yr · 1/3 the throughput of an H100 (~50 vs ~150 slots).
Locked to what runs on H20 — Qwen 7-72B, Llama 3.3, no Claude / GPT.
Cloud APIs run on H200 / B200 in SG; H100/H200/B200 cannot ship to HK under BIS rules.

Should we lock in to a single vendor — say, just Azure OpenAI?

No. Use Azure for OpenAI lanes only; route others to their best home.

OpenAI on Azure SG / HK: same list price · regional compliance covered.
Bedrock SG for Claude · Alibaba for Qwen · Moonshot for Kimi · 1-day swap when prices move.
When DeepSeek V5 ships at half-price, the router replaces V4 — no contract change needed.

Should every desk run on the Diamond plan (Claude Opus 4.7-led, max quality)?

Only for Strategy + EM. Platinum is the right firm-wide default.

Diamond ≈ +12 % over Platinum on the Golden Eval · most queries don't need that headroom.
Router already escalates Opus-class queries on demand · no need to pay for it on every msg.
Diamond per-analyst-yr difference vs Platinum is small at current msg/mo; widens at 2×.

Will the client's LLM use fewer tokens with MCP?

Yes — vs the realistic baseline of stuffing filings into context.

Without tools, the LLM reads raw filings + reports to answer questions · easily 50–100 k tokens of context per call.
With MCP, server-side compute returns compact results · a DCF ≈ 50 tokens, peer compare ≈ 200, not 5 k of LLM scratchpad.
Tool definitions add ~500–2 k tokens to the system prompt, cached on Claude / GPT-5 after the first call — the data-offload savings dwarf that.
Vs. another tool-calling protocol (OpenAI native, custom HTTP), MCP is equivalent on tokens · the value there is one tool surface across every consumer.

Should we wait for the next model generation before committing?

No — the router upgrades per-lane. Whatever ships next is a config change.

When GPT-6 or Claude Opus 4.8 lands, the router promotes it on the lanes it wins · no contract change needed.
Delaying 6 months costs the firm ~$370 k of analyst value · workload doesn't pause for the model roadmap.
Multi-vendor by design · worst-case lock-in is 30 days of usage data on whichever lane held the spot.