Private & confidential
Quanthack 2026 · Systematic trading

HELM

An AI-built trading desk, and the research behind it.
Prepared byKit (Keith) SoPrincipal Researcher & Lead Engineer
Date18 June 2026For rounds 21–24 June
Built withClaude OpusMulti-agent, via Claude Code
HELM · Quanthack 202601
Executive summaryPage 02

An AI research loop turned 20 GB of market tape into a governed, validated trading book.

01 · The book
98 strategies searched, 22 deployed
A diversified intraday book across FX, metals and crypto. Sharpe above three, maximum drawdown under two percent, all measured net of realistic costs.
02 · The honest odds
Built for the prizes we can win
Simulation shows the book reaches the final roughly half the time, but cannot win first place on return alone. So we target consistency, Sharpe and the technology award.
03 · The method
Run by Claude, multi-agent
66,256 lines of Python across 289 modules and 341 passing tests. The research programme itself was orchestrated by Claude, proposing, testing and refuting its own findings.
Governing thought

The durable edge in a rank tournament is not a magic price signal. It is calibrated variance plus survival, discovered and validated by a research loop that runs hundreds of times.

Source: portfolio_v3.py; win_simulator.py; repositoryPrivate & confidential02
Context · the competitionExhibit 1

We are ranked against 299 rivals each round, so the objective is return rank subject to survival.

Scoring componentWeightImplication
Return rank70%Relative return drives the score
Drawdown rank15%Shallow drawdowns are rewarded
Sharpe rank10%Risk-adjusted quality counts
Risk discipline5%Penalties for over-leverage
300 → 100
Three 24-hour rounds cut the field
48 h
Final ranks the survivors
30%
Margin stop-out; forced liquidation eliminates you
30x
Maximum leverage permitted
So what

Return rank dominates, but a single forced liquidation ends the campaign. The optimisation is return rank conditioned on never being eliminated.

Source: docs/PRD.md, official rules verified 16 June 2026Private & confidential03
ComplicationExhibit 2

Under realistic spreads and a one-bar delay, most price-prediction alpha proved fragile or absent.

Frictionless backtest
Flatters everything
Ignore the spread and add no execution delay, and almost any signal looks profitable.
Honest backtest
Ours did not
Real spread on every fill, one-bar entry delay, cost measured from the order book itself.
  • We tested signal families across FX majors, metals and crypto at five time frames.
  • Most edges were thin or vanished entirely once costs and delay were applied honestly.
  • That result was the most useful of the first week: it forced the real question, if the signal edge is thin, where does the win come from?
Source: research/lf_hunt, research/hft_hunt backtestsPrivate & confidential04
Resolution · the insightExhibit 3

The durable edge is calibrated variance plus survival, which is exactly what the scoring rewards.

Over-gamble
All-in at high leverage. A single 5% move against you triggers forced liquidation and elimination.
Calibrated band
Enough variance to climb the return rank, never enough to be liquidated. The book lives here.
Too quiet
Run flat to stay safe and the return rank leaves you mid-pack, cut at the next round.
The edge, stated plainly

Hold a high-quality, low-drawdown book, then dial variance up or down against live rank and time to the next cut. Most of the field sits in one of the two losing zones.

Source: engine/tournament_policy.py (variance dial)Private & confidential05
ApproachExhibit 4

A single repeatable loop, run hundreds of times, does the work.

01
Search
Every plausible signal, instrument and time frame.
02
Validate
Eight adversarial tests before anything trades.
03
Construct
Equal-weight survivors, de-bias, cap concentration.
04
Simulate
Score each leverage stance on a 300-player field.
05
Govern
Hard limits outside the model, never raised by it.
06
Iterate
Each result feeds the next question.
Why it matters

The same disciplined loop applies to every idea, so results are comparable and the audit trail is complete. The loop, not any single signal, is what we are presenting.

Source: research harness; scripts/portfolio_v3.py; engine/Private & confidential06
BreadthExhibit 5

We searched 98 candidate strategies across families, instruments and five time frames.

Family groupExamples
TrendCTA trend, momentum ignition
Mean reversionz-score, band, spike reversal
Breakout & volatilityvolatility breakout, channel
Flow & microstructureflow momentum, order-book skew
Session & calendarUS-closed hours, London fix, macro event
Regularised MLridge, lasso, logistic, selection
Time frameCandidates
5 minute9
15 minute11
1 hour43
4 hour35
Total98
No hindsight cell-picking; walk-forward selection per fold; real spread on every fill.
Source: data/tournament/combined_meta.json (98 strategies, per-trade re-evaluation)Private & confidential07
The filterExhibit 6

Eight adversarial tests reduced 98 candidates to 22 genuinely diversified survivors.

98 candidates
Pass at least 6 of 8 tests
Overfit probability, regime, annual, stability, ablation, decay, trade cost
Correlation cull
Drop near-duplicates so the book is not one bet in many guises
22 deployed
  • CPCV and PBO to test whether the edge is real or a lucky path.
  • Regime and per-year checks: does it hold when conditions change?
  • Stability and ablation: do neighbours agree, does each piece earn its place?
  • Decay and cost: survive the entry lag and clear cost at its own frequency.
Source: scripts/validate_strats.py, verify_survivors.py; data/tournament/final_universe.jsonPrivate & confidential08
Result · the bookExhibit 7

The resulting book compounds steadily, at a Sharpe above three and a drawdown under two percent.

Equal-frequency book equity curve and underwater drawdown
Equal-frequency book, full backtest at 1x. The flat start before August 2024 is data absence, not a flat strategy: not all sleeves have venue data before then.
3.1
Sharpe, full sample (3.5 on the active window since Aug 2024)
2.5%
Annualised volatility
+31%
Total return, 1x, net of cost
1.6%
Maximum drawdown
Source: scripts/portfolio_v3.py, eq_freq book, post-cost at 1x leveragePrivate & confidential09
The hard truthExhibit 8

Even fully levered, a high-Sharpe book cannot win first place on return alone.

Win probabilities by posture under real composite scoring
Monte-Carlo over a 300-player field using the real 70/15/10/5 composite scoring and a 30% stop-out.
What the simulation says

The quiet book reaches the final about 47% of the time, yet its probability of finishing first on return is effectively zero. Pushing to an aggressive posture only lifts P(first) to roughly 1%, while raising elimination risk. The return crown is a lottery; we decline to buy the ticket.

Source: scripts/win_simulator.py, real composite scoring, base fieldPrivate & confidential10
StrategyExhibit 9

We therefore optimise for the prizes our edge can actually win.

PrizeBasisOur fitPosture
First place, $30kReturn rankLowDecline the lottery; keep a live option only
Top-25 pool, $100kWhole-event consistencyMediumSurvive every cut, place steadily
Best SharpeWhole-event SharpeHighA Sharpe above three is our natural edge
Best Technology, $10kJudgedHighAn AI-run research programme, on the record
Allocation of effort

Survival-first execution protects the consistency and Sharpe prizes through every round; the technology entry is a separate, high-probability target judged on the work itself.

Source: win_simulator.py posture analysis; prize schedulePrivate & confidential11
Risk doctrineExhibit 10

Leverage is calibrated per round: ten times through the cuts, twenty-five reserved for the final.

Elimination risk and worst intra-round drawdown versus leverage
Elimination probability and worst 24-hour drawdown versus leverage, 22-strategy book.
RoundLeverageRationale
R1 to R310xSurvival ceiling; elimination risk near zero
Final, days 4 to 525xPush for rank once survival no longer binds
Binding constraint

At 10x the book's forced-liquidation probability is effectively zero. The binding limit is our own 10% kill latch, which sits outside the model and can only de-risk.

Source: scripts/leverage_per_round.py; engine/tournament_mc.py; risk caps in engine/risk.pyPrivate & confidential12
Partner technology · AnthropicExhibit 11

The entire research programme was run by Claude, orchestrating specialised agents in parallel.

Claude OpusOrchestrator
Research
Parallel sub-agents sweep families, instruments and frequencies
Adversary
Agents that try to refute each finding before it ships
Simulation
A 300-player field that plays the real tournament
Synthesis
Reconcile, re-derive and write the report
Converge → a verified, self-checked book
Capability on show

Claude proposed, tested, refuted and iterated. It caught its own data bias, conceded errors when challenged, and re-derived the answer. Many agents ran this loop at once, which is what made the search this thorough in the time available.

Source: Claude Code session history; commit logPrivate & confidential13
Technologies utilisedExhibit 12

Four technologies, each with a defined job.

Anthropic · Claude
Research engine
Claude Opus via Claude Code ran the multi-agent research, validation and simulation, and wrote the full codebase.
MetaTrader 5
Execution
The live order path. A protocol-based adapter runs in paper mode today and swaps to the live account on login.
Python stack
Quant core
NumPy, pandas, SciPy and scikit-learn power the signals, the walk-forward harness and the Monte-Carlo.
Cloudflare Pages
This deck
The presentation deploys as a single static file, generated by the same system that built the desk.
Source: repository; requirements.txt; engine/mt5_adapter.pyPrivate & confidential14
Data usageExhibit 13

Ground truth was rebuilt from 20.8 GB of five-level order-book tape.

  • 20.8 GB of five-level L2 tick data for FX and metals became bars, microstructure features and an executable cost model.
  • A 730-day crypto history covered BTC, ETH, SOL and XRP, the assets the venue tape excludes.
  • Cost was re-measured from the book itself. The only real cost is crossing the spread once per round trip, and it is small, so signal quality and survival are the true constraints rather than turnover.
Source: research/venue (bar and microstructure build); rerun_measured_costs.pyPrivate & confidential15
A working demonstrationExhibit 14

A complete desk runs today in paper mode, with hard risk limits outside the model.

Book
22 strategies
Equal-weight by frequency, diversified across FX, metals and crypto.
Risk
Hard caps
Halve, flatten and kill latches outside the model. It can de-risk, never raise a limit.
Policy
Variance dial
Leverage reacts to live rank and time to the next cut.
Execution
Live loop
Bar-final signals, restart reconciliation, feed-health checks, 70 tests green.
Simulator
Win model
Real composite scoring on a 300-player field, so leverage is a calculation.
Demo
It runs now
The whole loop drives a paper account today, with no live login required.
Source: engine/live_loop.py, desk.py, execution.py; tests/ (paper mode)Private & confidential16
TractionExhibit 15

The loop produced a substantial, fully tested codebase.

66,256
Lines of Python
289
Modules
341
Tests passing
98→22
Strategies searched, deployed
300
Player tournament model
67
Commits on the record
All figures are exact counts from the repository and test suite as of 18 June 2026. Nothing on this page is estimated.
Source: repository, pytest, git rev-listPrivate & confidential17
Private & confidential
Recommendation

Run the validated book survival-first, adapt leverage to live rank, and enter for Best Technology.

Rounds 1 to 310x, surviveProtect Sharpe and drawdown rank
Final 48 hoursUp to 25xPush rank once survival is secure
By 24 JuneSubmitPrivate repo, this deck, live demo
Kit (Keith) So · Principal Researcher & Lead EngineerHELM · Quanthack 202618