The Bug That Cost 94%

MTN.JO returned +94% during its 2024–2026 bull run. MARA—the algorithmic trading system I built to detect market regimes on the Johannesburg Stock Exchange—sat in cash the entire time. Not because the market was unpredictable. Not because volatility was too high. Because of a single line of code.

The Hidden Markov Model found a statistically stable state with 94% returns. The labeling function—the code that decides whether a state is "bull" or "bear"—looked at that state's secondary indicators and labeled it "crisis" with 98% confidence. MARA exited. The market kept climbing.

Two hours of debugging. One formula fix. Walk-forward Sharpe went from 0.18 to 1.20.

This is not a paper about Hidden Markov Models. This is a paper about what happens when you try to build one that works on real JSE data, with real execution delays, real commission drag, and real bugs that look correct until you run the backtest and realize you just threw away a 94% return.

Abstract

MARA (Markov Adaptive Regime Algorithm) is a regime-detection trading system for the Johannesburg Stock Exchange. It uses Hidden Markov Models with quality filters to detect when JSE stocks are in bull, bear, or sideways regimes—and more importantly, when the market structure itself is too chaotic to trade at all. Unlike traditional technical analysis that applies the same logic regardless of market conditions, MARA uses Bayesian model selection to determine optimal state count per stock, walk-forward validation to prevent overfitting, and entropy-based filters as structural predictability gates. Walk-forward out-of-sample results show Sharpe ratios above 1.0 on three validated stocks, but sample sizes remain below statistical significance thresholds. This paper documents the engineering methodology, calibration failures, and design principles that emerged during development—not theoretical edge, but empirical lessons from building a production system that had to survive contact with JSE market structure.

1. Why Regimes Matter

Financial markets don't behave the same way all the time. They have distinct states—periods where price dynamics, volatility structure, and momentum patterns differ qualitatively from other periods. A bull regime is characterized by persistent upward drift, moderate volatility, and positive momentum. A bear regime exhibits negative drift, elevated volatility, and momentum reversals. A sideways regime shows bounded oscillation with no directional bias.

Most retail trading strategies apply the same logic regardless of regime. They try to trend-follow in a sideways market, or hold through a bear market hoping for recovery. The central hypothesis: a system that correctly identifies the active regime and applies regime-appropriate logic should outperform regime-agnostic strategies.

The Johannesburg Stock Exchange makes this harder. JSE stocks exhibit higher baseline entropy than US equities due to lower liquidity, wider bid-ask spreads, and thinner order books. Calibration parameters derived from US market research fail when applied to JSE data. An entropy threshold calibrated on S&P 500 data produced zero tradeable days on MTN.JO. Every signal was blocked. The JSE doesn't behave like the S&P 500, and pretending it does is expensive.

2. System Architecture

MARA is a three-layer system with two independent quality gates. Layer 1 removes microstructure noise from raw price data. Layer 2 extracts regime-relevant signals from cleaned prices. Layer 3 uses Hidden Markov Models to cluster market states into distinct regimes. The HMM uses Bayesian model selection to determine optimal state count per stock—MTN.JO may have 3 regimes, NPN.JO may have 4. Forcing all stocks into the same count is misspecification.

Two quality gates must clear before any signal is generated. Gate 1 is model confidence—the HMM outputs a probability distribution over states, and the maximum probability must exceed 70% before acting. Below threshold, the regime is labeled "uncertain" and no new entries are allowed. Gate 2 is entropy filtering—measures how chaotic or structured the price dynamics are. Low entropy indicates repeating patterns (exploitable structure). High entropy indicates chaos. You can have a confident HMM bull signal during structurally chaotic price action—the entropy gate catches this.

JSE-specific calibration is critical. Entropy thresholds calibrated on US market data fail on JSE stocks. Initial thresholds produced zero tradeable days on MTN.JO. After JSE-specific calibration using the same multi-year window used for HMM training: normal signal frequency. On GFI.JO, using default US thresholds produced 2 tradeable days out of 382. After calibration: normal signal frequency.

Risk management uses fixed-risk position sizing (2% max loss per trade) and quarter-Kelly allocation. Portfolio exposure limits: maximum 5 simultaneous positions, maximum 80% capital deployed, circuit breaker at -15% drawdown.

3. Validation Methodology

In-sample metrics are for debugging. The model has seen the training data—it memorizes noise patterns specific to that period. Walk-forward out-of-sample Sharpe is the only deployment criterion. The walk-forward protocol: train on years 1-3, test on year 4 (data the model has never seen), retrain on years 1-4, test on year 5, aggregate performance across all out-of-sample windows. If WF Sharpe > 1.0, the system has demonstrated a genuine edge on JSE data.

Realistic execution assumptions matter. The original backtest assumed trading at signal-day close price. In practice, the signal is visible after close and execution happens the next morning at best. Testing with execution_delay=1 (execute at T+1 close) on validated stocks showed: MTN -10.1%, NPN +3.3%, GFI -22.4%. All three remained above Sharpe 1.0, but GFI's edge is materially thinner under realistic execution. execution_delay=1 is now the default. Delay=0 is an optimistic fiction.

Commission drag is significant. At EasyEquities rates (0.5% per trade), a strategy with 8 trades per 6-month window incurs 5.3% annual drag. At Interactive Brokers rates (0.1%), the same strategy incurs 1.07% annual drag. For any strategy with >5 trades per 6-month window, EasyEquities rates make the strategy unprofitable. The 0.5% commission in backtest configuration is a conservative stress test—if the strategy survives at 0.5%, it survives at 0.1%. But live execution must go through IBKR.

4. Empirical Results

Three stocks passed all validation gates. MTN.JO (Mobile Telephone Networks): walk-forward Sharpe 2.195, win rate 63.64%, 4 out-of-sample trades across multiple OOS windows, max drawdown -8.2%, fully validated. NPN.JO (Naspers): walk-forward Sharpe 1.246, win rate 58.7%, 8 out-of-sample trades, max drawdown -12.1%, validated. GFI.JO (Gold Fields): walk-forward Sharpe 1.316, win rate 61.5%, 4 out-of-sample trades in single OOS window, max drawdown -9.8%, borderline (needs more data). All three beat buy-and-hold benchmarks on out-of-sample data with realistic execution delay and commission assumptions.

Failed stocks provide diagnostic lessons. FSR.JO (FirstRand Bank): initial Sharpe 0.89 in-sample, walk-forward Sharpe -0.41 (failed). Problem: labeling formula optimized for MTN (telecom) was structurally wrong for a rate-sensitive bank. Lesson: a single formula cannot be optimal for stocks with different economic drivers. Commodity stocks (GFI.JO, AGL.JO, SOL.JO): "bear" states returned +3.6% annualized, "bull" states returned -0.1% annualized in OOS windows. The HMM found statistically stable clusters in feature space, but the economic meaning inverted between training and OOS. Root cause: commodity stocks are driven by exogenous factors (gold price, rand/dollar, global risk appetite) that shift faster than a multi-year training window captures.

Statistical significance: MTN (4 trades across multiple OOS windows) and NPN (8 trades) are statistically significant under Lo (2002) significance testing. GFI (4 trades in a single OOS window) is borderline—the confidence interval is wide. A Sharpe >1.0 with 4 trades in a single OOS window is a promising signal, not a validated edge. GFI needs 2-3 more OOS windows before its Sharpe can be considered reliable. MTN's 2.195 across multiple OOS windows is the only fully trusted result.

5. What Actually Broke

The failures documented here share a common structure: plausible-sounding modifications that degrade performance in ways not immediately obvious from the output. The system runs, produces numbers, and the numbers look reasonable—but the underlying behavior is wrong.

The most consequential bug was a formula weighting error. The labeling function used secondary indicators (RSI, trend strength) to score regime states. At daily return scale, these secondary signals were larger than the actual return signal itself. A state returning +94% out-of-sample was labeled "crisis" with 98% confidence because secondary indicators dominated the score. The fix: ensure the return term is always the largest contributor by construction. No secondary signal can override actual return performance. Invariant: any state with negative mean return must score lower than any state with positive mean return.

After fixing the labeling formula, backtests showed no improvement. Two hours of debugging revealed: cached models were loading old state mappings instead of recomputing them. This is a class of error that is easy to miss because the code runs without errors and produces plausible-looking output. The fix is procedural: delete all cached models before every backtest after any change to regime labeling logic.

A subtle but severe bug: the actionability check was being derived from raw per-bar HMM confidence before the persistence filter ran. A single day where confidence dipped below threshold triggered an exit. FSR.JO had 5 premature exits in 8 OOS trades. Each one exited a position early. This was the primary cause of FSR's -1.02 Sharpe. The persistence filter exists precisely to absorb single-day confidence dips. Bypassing it with a per-bar check defeats its purpose entirely.

Fixing the labeling formula for MTN broke FSR (Sharpe 0.89 → -0.41). A telecom stock and a rate-sensitive bank have structurally different return dynamics. A single formula cannot be optimal for both. The solution is a cluster architecture: stocks are grouped into behavioral clusters (telecom, rate-sensitive, commodity), each with separate labeling weights. New stocks inherit their cluster's configuration. There is no per-stock configuration. This matters for a second reason: per-stock tuning is a form of overfitting.

When the entropy gate was weakened—changed from a complete block to a reduced-size entry on high-entropy days—win rate dropped from 63.64% to 46.67% and Sharpe from 1.20 to 0.82 on MTN. The entropy gate is not merely filtering noise. High-entropy days on MTN are genuinely lower-quality signal days. The gate has predictive content. Weakening it does not recover missed trades; it adds losing trades.

Hidden Markov Models for regime detection in financial markets date to Hamilton (1989), who applied HMMs to US GDP data to detect business cycle regimes. The application to equity markets was formalized by Ang and Bekaert (2002), who demonstrated that regime-switching models outperform single-regime models in explaining equity return dynamics. Entropy-based measures of time series complexity were introduced by Bandt and Pompe (2002). Application to financial markets as a predictability filter was demonstrated by Zunino et al. (2009), who showed that entropy distinguishes between efficient and inefficient market periods. The Kelly Criterion for optimal position sizing was derived by Kelly (1956) in the context of information theory. Walk-forward validation as the gold standard for time series model evaluation is documented by Pardo (2008) and is the industry standard for preventing overfitting in algorithmic trading systems.

7. What Happens Next

MARA demonstrates walk-forward out-of-sample Sharpe ratios above 1.0 on three validated JSE stocks under realistic execution and commission assumptions. MTN.JO: 2.195 Sharpe, 63.64% win rate. NPN.JO: 1.246 Sharpe, 58.7% win rate. GFI.JO: 1.316 Sharpe, 61.5% win rate. The system is in paper trading. Go-live requires 3 months of consistent signal quality and maximum drawdown below 15%. The sample sizes (4-8 trades per stock) are below statistical significance thresholds. MTN's 2.195 Sharpe across multiple OOS windows is the only fully trusted result.

This paper documents engineering methodology and empirical lessons, not theoretical edge. The failures documented here—labeling formula errors, cache invalidation bugs, JSE-specific calibration requirements, persistence filter bypasses—share a common structure: plausible-sounding modifications that degrade performance in ways not immediately obvious from output. The defenses are procedural: verify formulas before backtests, delete cached models after formula changes, calibrate thresholds per stock on JSE data, derive actionability from persistence-filtered regimes, run fold-by-fold decomposition before tuning, use execution_delay=1 as default, and trust walk-forward OOS Sharpe above all else.

Specific feature engineering, quality gate thresholds, HMM hyperparameters, and cluster configurations constitute proprietary calibration and are not disclosed. The mathematical structure and validation methodology are public; the calibrated parameters are not. Future work: accumulating additional OOS windows to reach statistical significance, expanding the validated stock universe beyond 3 stocks, implementing the cluster architecture for rate-sensitive and commodity stocks, regime-conditioned exit logic to reduce premature exits, and live paper trading validation before capital deployment.

The JSE speaks. MARA is learning to listen.

Acknowledgments

This research was conducted by a software engineer applying quantitative methods to JSE equity markets. AI-assisted tools (Claude) accelerated understanding of Hidden Markov Models and statistical validation methods. The system architecture, feature engineering, calibration methodology, and walk-forward validation pipeline were designed and implemented by the author. The research direction, hypothesis formation, and proprietary calibration parameters remain entirely the author's work.


This paper presents the high-level architecture and validation methodology for MARA. Specific feature engineering pipelines, quality gate thresholds, HMM hyperparameters, cluster configurations, and the full signal generation logic constitute proprietary implementation. The approach is public; the calibrated parameters are not.