06 · Questions

The questions you'll get. Crisp answers, no hedging.

Six decisions analysts and the desk heads will raise. Each verdict is derived from the same Cost numbers — change the workload above and the dollar references move with it.

Should we just use Gemma (or any open-weight) and skip the frontier models?
No. Use open-weights as a router lane for low-stakes ops only.
  • Gemma 4 31B accuracy drops ~25 pp on Golden Eval financial reasoning vs GPT-5 / DeepSeek V4 Pro.
  • Third-party hosting clears at ~$0.14 in / $0.40 out — saves under $200 / month at our scale.
  • Wrong question to optimize · token cost is < 0.1 % of one bad call to the desk.
Should we standardize on DeepSeek V4 Pro as the workhorse?
Yes — for the workhorse lane. Not for every query.
  • About 1/5 of GPT-5 per token · best CN-equity reasoning in our Golden Eval.
  • Weaker on EN long-form composition — the router falls back to GPT-5.5 there.
  • Already the default in the Platinum lineup · 30 % of blended traffic.
Should we rent GPUs and self-host the frontier models?
Not at current scale. Revisit at 4× usage or sustained Diamond.
  • 2× H100 SG 1-yr commit = $736 k over 3 yr · fixed, regardless of usage.
  • Current cloud at Gold = $513 k over 3 yr · scales with the workload above.
  • Crossover only matters if utilization holds > 70 % · model agility lost the day we commit.
Should we buy GPUs (H20) in HK for data residency?
Only if compliance hard-requires on-prem inference.
  • 2× H20 = $900 k over 3 yr · 1/3 the throughput of an H100 (~50 vs ~150 slots).
  • Locked to what runs on H20 — Qwen 7-72B, Llama 3.3, no Claude / GPT.
  • Cloud APIs run on H200 / B200 in SG; H100/H200/B200 cannot ship to HK under BIS rules.
Should we lock in to a single vendor — say, just Azure OpenAI?
No. Use Azure for OpenAI lanes only; route others to their best home.
  • OpenAI on Azure SG / HK: same list price · regional compliance covered.
  • Bedrock SG for Claude · Alibaba for Qwen · Moonshot for Kimi · 1-day swap when prices move.
  • When DeepSeek V5 ships at half-price, the router replaces V4 — no contract change needed.
Should every desk run on the Diamond plan (Claude Opus 4.7-led, max quality)?
Only for Strategy + EM. Platinum is the right firm-wide default.
  • Diamond ≈ +12 % over Platinum on the Golden Eval · most queries don't need that headroom.
  • Router already escalates Opus-class queries on demand · no need to pay for it on every msg.
  • Diamond per-analyst-yr difference vs Platinum is small at current msg/mo; widens at 2×.
Will the client's LLM use fewer tokens with MCP?
Yes — vs the realistic baseline of stuffing filings into context.
  • Without tools, the LLM reads raw filings + reports to answer questions · easily 50–100 k tokens of context per call.
  • With MCP, server-side compute returns compact results · a DCF ≈ 50 tokens, peer compare ≈ 200, not 5 k of LLM scratchpad.
  • Tool definitions add ~500–2 k tokens to the system prompt, cached on Claude / GPT-5 after the first call — the data-offload savings dwarf that.
  • Vs. another tool-calling protocol (OpenAI native, custom HTTP), MCP is equivalent on tokens · the value there is one tool surface across every consumer.
Should we wait for the next model generation before committing?
No — the router upgrades per-lane. Whatever ships next is a config change.
  • When GPT-6 or Claude Opus 4.8 lands, the router promotes it on the lanes it wins · no contract change needed.
  • Delaying 6 months costs the firm ~$370 k of analyst value · workload doesn't pause for the model roadmap.
  • Multi-vendor by design · worst-case lock-in is 30 days of usage data on whichever lane held the spot.