Zeno AI Agent — SMC Benchmark Results
Public benchmark methodology comparing Zeno AI Agent against the leading general-purpose large language models on 50 carefully constructed Smart Money Concepts trading scenarios. Methodology, chart data, and prompts are reproducible. Updated quarterly.
Zeno AI Agent scored 45/50 (90%) on the SMC benchmark, versus Claude 4.7 at 32/50 (64%), ChatGPT-5 at 27/50 (54%), and Gemini 3 at 27/50 (54%). The gap reflects domain specialization — Zeno is purpose-trained on SMC pattern recognition where general LLMs are not. Full methodology and quarterly updates published below.
Category-by-category results
Methodology
How the benchmark works
Each test presents a real chart screenshot from a major asset (BTCUSDT, ETHUSDT, XAUUSD, EURUSD, NAS100) at a specific timestamp, alongside a question testing one Smart Money Concepts skill. The model under test must answer in a structured format we can score deterministically (multiple-choice or coordinate output for boundaries).
Each scenario is graded against ground truth established by three independent SMC traders (Zeno tier subscribers volunteered as graders) who each marked the chart before seeing any model's answer. A scenario is "correct" only when the model's answer matches the consensus ground truth.
Five test categories
- Order Block Identification (12 scenarios): identify the precise candle and price boundaries of a tradeable order block, given an impulsive structure break. Tests strict-formation rules vs permissive zones.
- FVG Detection & Mitigation Status (10 scenarios): identify Fair Value Gaps that have NOT yet been mitigated. Tests three-candle pattern recognition and historical price-tracking.
- Liquidity Sweep Recognition (8 scenarios): identify whether a wick that exceeded a prior swing high/low qualifies as a stop-hunt sweep with rejection. Tests reading institutional intent from candle anatomy.
- BOS / CHoCH Classification (10 scenarios): determine whether a specific structure break is a Break of Structure (continuation) or Change of Character (reversal). Tests prevailing-trend context.
- Multi-Timeframe Confluence (10 scenarios): given a chart array showing three timeframes simultaneously, identify which entry zones have alignment across all three. Tests synthesis across context windows.
Why generic LLMs underperform
General-purpose LLMs are trained on broad web data where SMC content is a small minority of trading material — most text discusses traditional indicators (RSI, MACD, moving averages). They learn SMC concepts approximately, with significant variance between definitions. They also struggle with the precise spatial reasoning required to identify a specific candle's boundaries on a chart screenshot.
Zeno is a specialist. Pattern detection runs through Pine-Script-derived SMC logic (deterministic), while natural-language explanations are produced by an LLM fine-tuned on a curated SMC corpus. The architecture means ground truth comes from rules, not from the model's memorized SMC knowledge.
Reproducibility
- The 50-scenario test set (charts + ground-truth labels, CC BY-NC 4.0) is available on request — contact us for access to the full dataset
- Prompts used for each model are documented per-scenario, including the exact temperature, system prompt, and image preprocessing
- Each model is run 3 times per scenario; scores are the consensus answer
- Tested model versions: Zeno (Q2 2026 release), Claude 4.7 (default), GPT-5 (default), Gemini 3 (default)
- Anyone can reproduce the benchmark within a few hours using their own API access
Frequently asked questions
Why are the test sizes uneven (8, 10, 10, 10, 12)?
Categories have different intrinsic complexity. Order Block Identification has the most scenarios (12) because there are more sub-cases (bullish vs bearish OBs, mitigated vs unmitigated, breaker blocks). Liquidity Sweeps have fewer (8) because the pattern is more uniform. The category counts reflect the natural distribution of SMC decisions a trader actually makes.
Can a generic LLM ever match Zeno on SMC?
With sufficient prompt engineering and few-shot examples, generic LLMs can close some of the gap on the simpler categories. Multi-Timeframe Confluence and BOS/CHoCH classification remain particularly difficult for them because they require synthesizing context across multiple chart images. As frontier models improve their multi-image reasoning, the gap will narrow — that's why we update quarterly.
Is this benchmark biased toward Zeno?
Test scenarios are drawn from real charts and graded by independent SMC traders before any model's response is seen. The grading panel does not know which model produced which answer. The methodology is published — anyone can construct alternative test sets and report their own results. The test set includes scenarios where Zeno also fails (5 of 50), which we publicly document.
How does this compare to LuxAlgo's Quant feature?
LuxAlgo's Quant is a marketing-driven AI tool with no published benchmarks, no public test set, and no methodology documentation. We publish ours specifically because that is the differentiator — verifiable AI capability, not marketed AI capability. We invite LuxAlgo or any competitor to run the same benchmark on their tool and publish the results.
Try Zeno on your own charts
Benchmarks are useful, but the only test that matters is whether Zeno helps you trade better on your own setups. Zeno is included in the Zeno tier ($79/month). Cancel anytime.
See Zeno Pricing →