Skip to main content

How to Read a Backtest: Metrics That Matter (and Ones That Lie)

· 9 min read
VolatiCloud Team
VolatiCloud

Every backtest report leads with total profit, and total profit is the number least worth trusting. Two strategies can post the same +60% over two years — one on a steady climb, the other spending fourteen months underwater before a single lucky quarter. Reading a backtest well means knowing which metrics carry real information, which ones routinely mislead, and in what order to check them.

This guide walks through that reading order: win rate versus expectancy, the three risk-adjusted ratios and their blind spots, drawdown and recovery, profit factor composition, and the trap where every metric lies at once. All example figures below are illustrative, constructed to show how the arithmetic behaves — they are not results from any real strategy.

The Headline Number Is One Path Through History

A backtest replays one sequence of trades against one slice of history. The total return at the bottom is real arithmetic, but it compresses everything you actually need to know — how bumpy the ride was, how concentrated the gains were, how bad the worst stretch got — into a single figure that a different year, pair, or entry offset could easily rearrange.

So treat total profit as an admission ticket, nothing more. If it's negative, you're done. If it's positive, the reading starts.

Win Rate: The Metric That Lies Most Confidently

Win rate — the percentage of trades that closed profitably — is the most intuitive metric on the report and the one most likely to deceive. A 78% win rate feels like a strong strategy. It says nothing about whether the strategy makes money, because it says nothing about the size of the wins relative to the losses.

Consider two illustrative strategies:

Strategy AStrategy B
Win rate78%42%
Average win+0.6%+2.8%
Average loss−2.4%−1.0%
Expectancy per trade−0.06%+0.60%

Strategy A wins almost four trades out of five and still loses money, because each loss erases four wins. Strategy B loses more often than it wins and is solidly profitable. This pattern isn't exotic — it's the natural shape of mean-reversion systems (many small wins, occasional large losses) versus trend-following systems (many small losses, occasional large wins).

The number that resolves the ambiguity is expectancy — the average profit you can expect per trade once wins and losses are netted against their frequencies:

Expectancy = (Win Rate × Average Win) − (Loss Rate × Average Loss)

If expectancy is negative, no other metric can save the strategy. The glossary entries on win rate and expectancy cover the formulas and typical ranges in more depth.

Read win rate as a style descriptor — it tells you whether the strategy grinds or lunges. Read expectancy to learn whether it earns.

Sharpe, Sortino, Calmar: Three Ratios, Three Blind Spots

Risk-adjusted ratios all answer the same question — "was the return worth the risk?" — but each defines risk differently, and each definition has a failure mode.

RatioReturn divided byWhere it misleads
SharpeTotal volatility (up and down)Penalizes big winning moves; understates asymmetric strategies
SortinoDownside volatility onlyWith few losing periods, the denominator is noisy; hides rare catastrophes
CalmarMaximum drawdownHostage to a single worst event; a crash-free test window inflates it

The Sharpe ratio treats all volatility as bad. A trend-following strategy that occasionally triples its monthly return gets punished for those months, because they raise the standard deviation. Sharpe also behaves as if returns were roughly normally distributed — crypto returns are not, so strategies with fat-tailed behavior can score respectably right up until the tail arrives.

The Sortino ratio fixes the upside problem by counting only downside deviation. Its blind spot is the opposite one: a strategy that rarely loses — but loses catastrophically when it does — computes its downside deviation from a handful of observations. A short backtest with six losing days produces a Sortino figure with almost no statistical weight behind it.

The Calmar ratio divides annualized return by maximum drawdown, which makes it the most honest of the three about worst-case pain — and the most fragile, because the denominator is a single event. Test over a window that happened to dodge a crash and Calmar flatters the strategy; include one bad week and it collapses, even if the other 103 weeks were identical.

The practical move is to read all three together and treat disagreement as information:

  • Sharpe well below Sortino — volatility is mostly upside. Common for trend-followers; usually fine.
  • Calmar lagging both — the average ride is smooth, but the single worst episode was severe. Ask whether you'd have kept the bot running through it.
  • All three high on a short window — suspend judgment until you've seen more data (see the overfitting section below).

Max Drawdown Is a Minimum Estimate, Not a Maximum

Maximum drawdown — the deepest peak-to-trough decline in the equity curve — is the metric most likely to determine whether you survive the strategy, because drawdowns are when humans intervene and turn a temporary loss into a permanent one.

Two things the single number hides:

Recovery is asymmetric. Losses require disproportionate gains to undo:

DrawdownGain needed to recover
10%11.1%
20%25%
33%49.3%
50%100%

It's one draw from a distribution. Your backtest's max drawdown is the worst stretch in that particular trade sequence. Reshuffle the same trades and the clustering changes — sometimes the losses bunch together and dig a much deeper hole. This is why a Monte Carlo simulation, which re-runs your trade history across thousands of shuffled sequences, typically reports a 95th-percentile drawdown meaningfully worse than the single backtest showed. On VolatiCloud, Monte Carlo analysis is available on Pro and Enterprise plans — the Monte Carlo guide explains how to read the p5/p95 bands.

Check the duration alongside the depth, too: a 15% drawdown recovered in two weeks and a 15% drawdown that lasted seven months are very different experiences. For stoploss configuration, position-sizing effects, and the recovery factor, see the dedicated post on managing drawdown in crypto bots.

Profit Factor: Reliable, Until One Trade Does All the Work

Profit factor — gross profits divided by gross losses — is one of the sturdier metrics on the report. Above 1.0 the strategy is net profitable; 1.5+ is generally considered solid. It's harder to game than win rate because it accounts for magnitude.

Its failure mode is concentration. Because it sums gross profit, one enormous winner can carry an otherwise mediocre system. The check is simple: mentally (or actually) remove the single best trade and re-estimate. As an illustration, a strategy showing a 1.8 profit factor across 120 trades that drops to 1.1 without its best trade doesn't have an edge — it has a lottery ticket it already cashed. A strategy that drops from 1.8 to 1.7 has a distributed, repeatable edge.

The same logic applies in reverse to the worst trade: if a single outlier loss is masking an otherwise strong system, that's a risk-control problem (usually a stoploss problem), not an edge problem.

Trade Count: Decides Whether Anything Else Means Something

Every metric above is a statistic, and statistics computed on 18 trades are noise. Below roughly 30 trades, win rate, expectancy, and all three ratios are dominated by luck; meaningful confidence starts around 100 trades and strengthens past 300. If the count is low, extend the date range or add pairs before interpreting anything else. The sample-size guide covers confidence intervals and the multiple-testing problem in detail.

When Every Metric Lies at Once: The Overfitting Trap

There is one scenario where every number on the report is simultaneously excellent and simultaneously wrong: a strategy tuned until it memorized the test data. Overfit strategies produce beautiful reports — that's precisely what the tuning optimized for.

warning

Treat a backtest that looks too clean as a symptom, not a success: Sharpe above 3, win rate above 85%, near-zero drawdown, or an equity curve without a single flat stretch. Real edges are lumpy.

The defense is validation the strategy never saw during development — out-of-sample periods, walk-forward testing, and parameter-sensitivity checks. That's a discipline of its own, covered end-to-end in avoiding overfitting in crypto backtests. For this post's purposes, the rule is: metrics from data the strategy was tuned on are candidates; metrics from data it wasn't are evidence.

A Reading Order That Works

Putting it together, read a backtest in this order — each step is a gate for the next:

  1. Trade count — is there enough sample to read anything at all?
  2. Expectancy — is there a per-trade edge once win size and loss size are netted?
  3. Profit factor, then its composition — is the edge distributed across trades, or is one outlier carrying the report?
  4. Sharpe / Sortino / Calmar together — was the return worth the volatility, and what does their disagreement tell you?
  5. Max drawdown, depth and duration — could you have lived through the worst stretch without pulling the plug?
  6. Out-of-sample confirmation — do the numbers survive data the strategy never saw?

Every metric in this list — Sharpe, Sortino, Calmar, profit factor, expectancy, win rate, max drawdown, plus CAGR on runs of 30 days or longer — is computed automatically in VolatiCloud's backtesting engine and shown in the results panel of every run, with per-pair breakdowns and the full trade list underneath. The backtesting overview documents each field and its target range.

When a strategy passes all six gates, it has earned a dry run — paper trading against live market data — before it earns real capital. A backtest tells you the strategy had an edge; the dry run tells you the edge survives live execution.


Open a strategy in the VolatiCloud console, run a backtest over at least twelve months, and walk the results through the six gates above. The metrics are all on one screen — the reading order is what turns them into a decision.