Private & confidential

Quanthack 2026 · Systematic trading

HELM

An AI-built trading desk, and the research behind it.

Prepared byKit (Keith) SoPrincipal Researcher & Lead Engineer

Date18 June 2026For rounds 21–24 June

Built withClaude OpusMulti-agent, via Claude Code

HELM · Quanthack 202601

Executive summaryPage 02

An AI research loop turned 20 GB of market tape into a governed, validated trading book.

01 · The book

98 strategies searched, 22 deployed

A diversified intraday book across FX, metals and crypto. Sharpe above three, maximum drawdown under two percent, all measured net of realistic costs.

02 · The honest odds

Built for the prizes we can win

Simulation shows the book reaches the final roughly half the time, but cannot win first place on return alone. So we target consistency, Sharpe and the technology award.

03 · The method

Run by Claude, multi-agent

66,256 lines of Python across 289 modules and 341 passing tests. The research programme itself was orchestrated by Claude, proposing, testing and refuting its own findings.

Governing thought

The durable edge in a rank tournament is not a magic price signal. It is calibrated variance plus survival, discovered and validated by a research loop that runs hundreds of times.

Source: portfolio_v3.py; win_simulator.py; repositoryPrivate & confidential02

Context · the competitionExhibit 1

We are ranked against 299 rivals each round, so the objective is return rank subject to survival.

Scoring component	Weight	Implication
Return rank	70%	Relative return drives the score
Drawdown rank	15%	Shallow drawdowns are rewarded
Sharpe rank	10%	Risk-adjusted quality counts
Risk discipline	5%	Penalties for over-leverage

300 → 100

Three 24-hour rounds cut the field

48 h

Final ranks the survivors

30%

Margin stop-out; forced liquidation eliminates you

30x

Maximum leverage permitted

So what

Return rank dominates, but a single forced liquidation ends the campaign. The optimisation is return rank conditioned on never being eliminated.

Source: docs/PRD.md, official rules verified 16 June 2026Private & confidential03

ComplicationExhibit 2

Under realistic spreads and a one-bar delay, most price-prediction alpha proved fragile or absent.

Frictionless backtest

Flatters everything

Ignore the spread and add no execution delay, and almost any signal looks profitable.

Honest backtest

Ours did not

Real spread on every fill, one-bar entry delay, cost measured from the order book itself.

We tested signal families across FX majors, metals and crypto at five time frames.
Most edges were thin or vanished entirely once costs and delay were applied honestly.
That result was the most useful of the first week: it forced the real question, if the signal edge is thin, where does the win come from?

Source: research/lf_hunt, research/hft_hunt backtestsPrivate & confidential04

Resolution · the insightExhibit 3

The durable edge is calibrated variance plus survival, which is exactly what the scoring rewards.

Over-gamble

All-in at high leverage. A single 5% move against you triggers forced liquidation and elimination.

Calibrated band

Enough variance to climb the return rank, never enough to be liquidated. The book lives here.

Too quiet

Run flat to stay safe and the return rank leaves you mid-pack, cut at the next round.

The edge, stated plainly

Hold a high-quality, low-drawdown book, then dial variance up or down against live rank and time to the next cut. Most of the field sits in one of the two losing zones.

Source: engine/tournament_policy.py (variance dial)Private & confidential05

ApproachExhibit 4

A single repeatable loop, run hundreds of times, does the work.

01

Search

Every plausible signal, instrument and time frame.

02

Validate

Eight adversarial tests before anything trades.

03

Construct

Equal-weight survivors, de-bias, cap concentration.

04

Simulate

Score each leverage stance on a 300-player field.

05

Govern

Hard limits outside the model, never raised by it.

06

Iterate

Each result feeds the next question.

Why it matters

The same disciplined loop applies to every idea, so results are comparable and the audit trail is complete. The loop, not any single signal, is what we are presenting.

Source: research harness; scripts/portfolio_v3.py; engine/Private & confidential06

BreadthExhibit 5

We searched 98 candidate strategies across families, instruments and five time frames.

Family group	Examples
Trend	CTA trend, momentum ignition
Mean reversion	z-score, band, spike reversal
Breakout & volatility	volatility breakout, channel
Flow & microstructure	flow momentum, order-book skew
Session & calendar	US-closed hours, London fix, macro event
Regularised ML	ridge, lasso, logistic, selection

Time frame	Candidates
5 minute	9
15 minute	11
1 hour	43
4 hour	35
Total	98

No hindsight cell-picking; walk-forward selection per fold; real spread on every fill.

Source: data/tournament/combined_meta.json (98 strategies, per-trade re-evaluation)Private & confidential07

The filterExhibit 6

Eight adversarial tests reduced 98 candidates to 22 genuinely diversified survivors.

98 candidates

Pass at least 6 of 8 tests

Overfit probability, regime, annual, stability, ablation, decay, trade cost

Correlation cull

Drop near-duplicates so the book is not one bet in many guises

22 deployed

CPCV and PBO to test whether the edge is real or a lucky path.
Regime and per-year checks: does it hold when conditions change?
Stability and ablation: do neighbours agree, does each piece earn its place?
Decay and cost: survive the entry lag and clear cost at its own frequency.

Source: scripts/validate_strats.py, verify_survivors.py; data/tournament/final_universe.jsonPrivate & confidential08

Result · the bookExhibit 7

The resulting book compounds steadily, at a Sharpe above three and a drawdown under two percent.

Equal-frequency book equity curve and underwater drawdown

Equal-frequency book, full backtest at 1x. The flat start before August 2024 is data absence, not a flat strategy: not all sleeves have venue data before then.

3.1

Sharpe, full sample (3.5 on the active window since Aug 2024)

2.5%

Annualised volatility

+31%

Total return, 1x, net of cost

1.6%

Maximum drawdown

Source: scripts/portfolio_v3.py, eq_freq book, post-cost at 1x leveragePrivate & confidential09

The hard truthExhibit 8

Even fully levered, a high-Sharpe book cannot win first place on return alone.

Win probabilities by posture under real composite scoring

Monte-Carlo over a 300-player field using the real 70/15/10/5 composite scoring and a 30% stop-out.

What the simulation says

The quiet book reaches the final about 47% of the time, yet its probability of finishing first on return is effectively zero. Pushing to an aggressive posture only lifts P(first) to roughly 1%, while raising elimination risk. The return crown is a lottery; we decline to buy the ticket.

Source: scripts/win_simulator.py, real composite scoring, base fieldPrivate & confidential10

StrategyExhibit 9

We therefore optimise for the prizes our edge can actually win.

Prize	Basis	Our fit	Posture
First place, $30k	Return rank	Low	Decline the lottery; keep a live option only
Top-25 pool, $100k	Whole-event consistency	Medium	Survive every cut, place steadily
Best Sharpe	Whole-event Sharpe	High	A Sharpe above three is our natural edge
Best Technology, $10k	Judged	High	An AI-run research programme, on the record

Allocation of effort

Survival-first execution protects the consistency and Sharpe prizes through every round; the technology entry is a separate, high-probability target judged on the work itself.

Source: win_simulator.py posture analysis; prize schedulePrivate & confidential11

Risk doctrineExhibit 10

Leverage is calibrated per round: ten times through the cuts, twenty-five reserved for the final.

Elimination risk and worst intra-round drawdown versus leverage

Elimination probability and worst 24-hour drawdown versus leverage, 22-strategy book.

Round	Leverage	Rationale
R1 to R3	10x	Survival ceiling; elimination risk near zero
Final, days 4 to 5	25x	Push for rank once survival no longer binds

Binding constraint

At 10x the book's forced-liquidation probability is effectively zero. The binding limit is our own 10% kill latch, which sits outside the model and can only de-risk.

Source: scripts/leverage_per_round.py; engine/tournament_mc.py; risk caps in engine/risk.pyPrivate & confidential12

Partner technology · AnthropicExhibit 11

The entire research programme was run by Claude, orchestrating specialised agents in parallel.

Claude OpusOrchestrator

Research

Parallel sub-agents sweep families, instruments and frequencies

Adversary

Agents that try to refute each finding before it ships

Simulation

A 300-player field that plays the real tournament

Synthesis

Reconcile, re-derive and write the report

Converge → a verified, self-checked book

Capability on show

Claude proposed, tested, refuted and iterated. It caught its own data bias, conceded errors when challenged, and re-derived the answer. Many agents ran this loop at once, which is what made the search this thorough in the time available.

Source: Claude Code session history; commit logPrivate & confidential13

Technologies utilisedExhibit 12

Four technologies, each with a defined job.

Anthropic · Claude

Research engine

Claude Opus via Claude Code ran the multi-agent research, validation and simulation, and wrote the full codebase.

MetaTrader 5

Execution

The live order path. A protocol-based adapter runs in paper mode today and swaps to the live account on login.

Python stack

Quant core

NumPy, pandas, SciPy and scikit-learn power the signals, the walk-forward harness and the Monte-Carlo.

Cloudflare Pages

This deck

The presentation deploys as a single static file, generated by the same system that built the desk.

Source: repository; requirements.txt; engine/mt5_adapter.pyPrivate & confidential14

Data usageExhibit 13

Ground truth was rebuilt from 20.8 GB of five-level order-book tape.

20.8 GB of five-level L2 tick data for FX and metals became bars, microstructure features and an executable cost model.
A 730-day crypto history covered BTC, ETH, SOL and XRP, the assets the venue tape excludes.
Cost was re-measured from the book itself. The only real cost is crossing the spread once per round trip, and it is small, so signal quality and survival are the true constraints rather than turnover.

Source: research/venue (bar and microstructure build); rerun_measured_costs.pyPrivate & confidential15

A working demonstrationExhibit 14

A complete desk runs today in paper mode, with hard risk limits outside the model.

Book

22 strategies

Equal-weight by frequency, diversified across FX, metals and crypto.

Risk

Hard caps

Halve, flatten and kill latches outside the model. It can de-risk, never raise a limit.

Policy

Variance dial

Leverage reacts to live rank and time to the next cut.

Execution

Live loop

Bar-final signals, restart reconciliation, feed-health checks, 70 tests green.

Simulator

Win model

Real composite scoring on a 300-player field, so leverage is a calculation.

Demo

It runs now

The whole loop drives a paper account today, with no live login required.

Source: engine/live_loop.py, desk.py, execution.py; tests/ (paper mode)Private & confidential16

TractionExhibit 15

The loop produced a substantial, fully tested codebase.

66,256

Lines of Python

289

Modules

341

Tests passing

98→22

Strategies searched, deployed

300

Player tournament model

67

Commits on the record

All figures are exact counts from the repository and test suite as of 18 June 2026. Nothing on this page is estimated.

Source: repository, pytest, git rev-listPrivate & confidential17

Private & confidential

Recommendation

Run the validated book survival-first, adapt leverage to live rank, and enter for Best Technology.

Rounds 1 to 310x, surviveProtect Sharpe and drawdown rank

Final 48 hoursUp to 25xPush rank once survival is secure

By 24 JuneSubmitPrivate repo, this deck, live demo

Kit (Keith) So · Principal Researcher & Lead EngineerHELM · Quanthack 202618