How Gold Diff turns build choices into MWPA scores.
The site is designed to answer one question cleanly: did this build choice actually help, or did it just show up more often in games that were already won? MWPA estimates the difference by comparing each option to an appropriate baseline instead of raw global win rate.
xPetu's MWPA thesis
The Mean Win Probability Added framework used throughout Gold Diff is directly adapted from research by xPetu. The core thesis: raw win rates on items and runes are misleading because they conflate correlation with causation. MWPA isolates the contribution of a single build decision by comparing outcomes against an appropriate contextual baseline, averaged across many games.
Without xPetu's work, most build sites would still be reporting that Mejai's Soulstealer has a 75% win rate and calling it a recommendation. MWPA cuts through that noise.
“An item with 60% win rate might only appear in games that are already won. MWPA asks: did buying this item actually change the probability of winning, or was the game decided before the purchase?”
— Adapted from xPetu's build analysis framework
Item genuinely increased win probability across the sample
Item was associated with lower win probability after accounting for game state
Champion baseline
Every score starts from the champion's own win rate inside the current sample. That keeps naturally strong or weak champions from distorting the read.
Eligible baselines
Later build checkpoints are only compared against games that actually reached them. A third-item option is not measured against one-item stomps.
Composite signal
Items, boots, rune pages, and summoner spells stack into a composite read. The total is directional guidance, not an exact promise of match outcome.
Two lenses on build quality
Gold Diff uses two complementary metrics. MWPA measures real in-game impact on win probability. Gold Efficiency measures raw stat value per gold spent. Together they answer: is this item actually winning games, and is it a good deal from the shop?
When dense timeline states are missing, Gold Diff falls back to item-count-adjusted excess win rate against eligible baselines. That fallback is still debiased, but it is not the same thing as decision-time WP-delta.
Convert each stat on the item into gold using reference items (Long Sword = 35g per AD, Amp Tome = 21.75g per AP, Ruby Crystal = 2.67g per HP, etc). Sum the stat values, divide by item cost, multiply by 100.
Gold efficiency only counts base stats available in Data Dragon (AD, AP, HP, Armor, MR, AS, Crit, MS, Mana). Passive effects, ability haste, lethality, and unique actives are not factored into the GE percentage. An item with 80% GE might still be excellent if its passive is worth the gap. Gold Diff now pairs GE with passive profile notes and purchase-context reads so low-GE items can still be understood without pretending every passive has one exact gold price.
Measures change in probability of winning. Accounts for game state context. Tells you if the item actually helped win games, not just if winners bought it.
Measures stat value per gold spent. Pure economy metric. Tells you if Boris is ripping you off on raw stats, independent of whether those stats win games.
The option outperformed the relevant baseline in the observed sample.
The option underperformed the relevant baseline in the observed sample.
Larger sample sizes deserve more trust. Rare niche paths can still swing hard even after debiasing.
Where raw MWPA breaks down — and how we fix it
The xPetu thesis provides the theoretical foundation, but applying it to real ranked data surfaces practical failure modes. Gold Diff implements corrections for each one. The scoring pipeline is:
Each score starts as excess win rate over the eligible baseline. When timeline data is available, in-game choices (items, boots, build order) are enriched with a calibrated WP-delta at decision time. Pregame choices (runes, full pages, summoner spells, skill order, starting items) use model-predicted eWPA when the early-state baseline exists, and otherwise stay flagged as contextual fallbacks. All scores are then shrunk toward a champion-role prior anchored to a fixed zero global baseline and paired with confidence intervals.
MWPA inherits errors from the win-probability model. If predicted probabilities don't match observed frequencies, every downstream score is biased.
The WP model is wrapped with isotonic calibration (CalibratedClassifierCV) before it is used for WP-delta enrichment on items, boots, and build-order slots, plus early-window eWPA on runes, full rune pages, summoner spells, and skill orders. We track ECE (Expected Calibration Error) before and after, plus time-sliced ECE to catch late-game calibration drift.
MWPA is a difference of two proportions (build WR minus baseline WR). Naive CIs treat the baseline as fixed, understating true uncertainty.
Gold Diff uses Newcombe-Wilson intervals (Method 10) which account for uncertainty in both the build and baseline win rates. The UI flags wide intervals and marks low-sample results with an amber indicator.
Narrowing the state space (patch + champion + role + matchup + item slot) makes estimates more meaningful, but sample sizes shrink fast.
Two-level shrinkage: each analysis batch estimates a champion-role prior from the games-weighted average raw effect, then shrinks individual rows toward that prior while keeping a fixed zero global anchor. This keeps rare builds honest without collapsing every low-sample estimate straight to zero.
The thesis notes that determining when an item's effect ends is difficult. Using win/loss as the outcome absorbs unrelated later events.
Item-count-adjusted baselines compare each build against games completing the same number of items, isolating the build decision from game length survivorship bias.
The thesis warns simplified state representations risk omitting important variables. The Riot API data is inherently limited.
Gold Diff enriches context with gold-at-purchase, timing buckets (ahead/even/behind), matchup damage type, and duration bucketing to reduce omitted-variable bias. Boots now use exact final-purchase timestamps when timeline data has been backfilled; older legacy rows fall back to a fixed minute-12 proxy until they are refreshed.
Item values shift with balance changes and meta evolution. A model trained on Patch 14.15 data is stale by 14.18.
The scheduler exports current-patch analysis every 6 hours and retrains the WP model weekly. Model evaluation prefers a latest-patch holdout when possible and persists the train/test patch window in metadata, so patch drift is measured explicitly instead of assumed away.
The WP model uses a single game state (Markov assumption). Momentum effects — comeback streaks, snowball runs — are not captured.
Gold Diff includes per-minute rate features (gold_diff_per_min, cs_diff_per_min) which implicitly encode momentum. Full sequence-aware models (RNN/Transformer) are on the roadmap.
Without a clear threshold, low-sample build paths can appear as confident recommendations.
Results below the minimum sample threshold are excluded from the export. All published scores carry a sample quality label (high / moderate / low) and visible game count.
Before any score is published
What Boris charges per stat point
These per-unit gold values come from the cheapest basic component that provides each stat in isolation. They are the foundation of every gold efficiency calculation on the site.