Methodology

How Gold Diff turns build choices into MWPA scores.

The site is designed to answer one question cleanly: did this build choice actually help, or did it just show up more often in games that were already won? MWPA estimates the difference by comparing each option to an appropriate baseline instead of raw global win rate.

Academic Foundation
Attribution

xPetu's MWPA thesis

The Mean Win Probability Added framework used throughout Gold Diff is directly adapted from research by xPetu. The core thesis: raw win rates on items and runes are misleading because they conflate correlation with causation. MWPA isolates the contribution of a single build decision by comparing outcomes against an appropriate contextual baseline, averaged across many games.

Without xPetu's work, most build sites would still be reporting that Mejai's Soulstealer has a 75% win rate and calling it a recommendation. MWPA cuts through that noise.

Key insight

“An item with 60% win rate might only appear in games that are already won. MWPA asks: did buying this item actually change the probability of winning, or was the game decided before the purchase?”

— Adapted from xPetu's build analysis framework

+MWPA

Item genuinely increased win probability across the sample

-MWPA

Item was associated with lower win probability after accounting for game state

Core ideas
Core idea

Champion baseline

Every score starts from the champion's own win rate inside the current sample. That keeps naturally strong or weak champions from distorting the read.

Core idea

Eligible baselines

Later build checkpoints are only compared against games that actually reached them. A third-item option is not measured against one-item stomps.

Core idea

Composite signal

Items, boots, rune pages, and summoner spells stack into a composite read. The total is directional guidance, not an exact promise of match outcome.

The Mathematics

Two lenses on build quality

Gold Diff uses two complementary metrics. MWPA measures real in-game impact on win probability. Gold Efficiency measures raw stat value per gold spent. Together they answer: is this item actually winning games, and is it a good deal from the shop?

In-Game MWPA (Items, Boots, Build Order)
MWPA(a) ≈ avg(outcome − WP(model @ decision state))
athe build decision (specific item or item path)WP(model)calibrated pre-decision win probability from the dense timeline modelavg(...)average across all observed decisions of that type

When dense timeline states are missing, Gold Diff falls back to item-count-adjusted excess win rate against eligible baselines. That fallback is still debiased, but it is not the same thing as decision-time WP-delta.

How to read
Positive
Item increases win chance
Zero
Neutral impact
Negative
Item hurts win chance
Gold Efficiency (Traditional Item Value)
GE = Σ(Stat​ᵢ × GoldValue​ᵢ)Item Cost × 100

Convert each stat on the item into gold using reference items (Long Sword = 35g per AD, Amp Tome = 21.75g per AP, Ruby Crystal = 2.67g per HP, etc). Sum the stat values, divide by item cost, multiply by 100.

Statiquantity of each stat on the itemGoldValueigold worth of one unit of that stat (from reference items)
How to read
>100%
Very efficient buy
100%
Exactly worth the gold
<100%
Inefficient unless passive is strong
Note

Gold efficiency only counts base stats available in Data Dragon (AD, AP, HP, Armor, MR, AS, Crit, MS, Mana). Passive effects, ability haste, lethality, and unique actives are not factored into the GE percentage. An item with 80% GE might still be excellent if its passive is worth the gap. Gold Diff now pairs GE with passive profile notes and purchase-context reads so low-GE items can still be understood without pretending every passive has one exact gold price.

Quick comparison
MWPA

Measures change in probability of winning. Accounts for game state context. Tells you if the item actually helped win games, not just if winners bought it.

Gold Efficiency

Measures stat value per gold spent. Pure economy metric. Tells you if Boris is ripping you off on raw stats, independent of whether those stats win games.

Item path scoring

Gold Diff tracks the first three completed items from timeline data. Each exact one-item, two-item, and three-item prefix is scored against the win rate of games that reached the same prefix length.

fallback path score = path win rate − eligible baseline

Example: a two-item opening is compared to other two-item openings for that same champion, not to the champion's overall win rate across all game states. When dense timeline frames exist, the exported ranking is upgraded again with decision-time WP-delta.

Boots remain separate because they often slot in between major completions and behave more like a supporting choice than a core ordered path decision.

The site also stores average completion timing and purchase gold for each slot item. That lets the UI flag whether an option usually lands ahead, behind, early, or on-curve relative to the champion's normal purchase window.

Pregame choice scoring (eWPA)

Runes, full rune pages, summoner spells, skill orders, and starting items are pregame choices — locked in before the match starts. Because there is no in-game purchase moment, these cannot use a true decision-time WP-delta. When early timeline data is available, Gold Diff instead uses eWPA (Expected Win Probability Added): the option's observed win rate minus the model's predicted win probability across early evaluation windows (5, 10, and 15 minutes). If that model baseline is unavailable, the site falls back to contextual excess win rate and marks the row as a pregame fallback instead of pretending it was model-derived.

eWPA(option) = WR(option) − avg WP(model @ {5,10,15}m | option)

The model's early-game expectation becomes the baseline instead of pretending pregame choices have an in-match purchase timestamp.

Full rune pages are evaluated as a unit, and skill orders are grouped by the champion's Q/W/E max pattern. When you select a tree, keystone, and individual rows, the site filters to sampled pages that match those choices and uses their eWPA. Partial pages use a weighted blend of matching full pages.

This keeps rune scoring grounded in real combinations instead of pretending each rune exists in isolation from the rest of the page.

How to read the numbers
Positive MWPA

The option outperformed the relevant baseline in the observed sample.

Negative MWPA

The option underperformed the relevant baseline in the observed sample.

Confidence

Larger sample sizes deserve more trust. Rare niche paths can still swing hard even after debiasing.

Beyond the thesis

Where raw MWPA breaks down — and how we fix it

The xPetu thesis provides the theoretical foundation, but applying it to real ranked data surfaces practical failure modes. Gold Diff implements corrections for each one. The scoring pipeline is:

raw excess WRWP-delta enrichmenthierarchical shrinkageNewcombe-Wilson CI

Each score starts as excess win rate over the eligible baseline. When timeline data is available, in-game choices (items, boots, build order) are enriched with a calibrated WP-delta at decision time. Pregame choices (runes, full pages, summoner spells, skill order, starting items) use model-predicted eWPA when the early-state baseline exists, and otherwise stay flagged as contextual fallbacks. All scores are then shrunk toward a champion-role prior anchored to a fixed zero global baseline and paired with confidence intervals.

1Model calibration
Issue

MWPA inherits errors from the win-probability model. If predicted probabilities don't match observed frequencies, every downstream score is biased.

Fix

The WP model is wrapped with isotonic calibration (CalibratedClassifierCV) before it is used for WP-delta enrichment on items, boots, and build-order slots, plus early-window eWPA on runes, full rune pages, summoner spells, and skill orders. We track ECE (Expected Calibration Error) before and after, plus time-sliced ECE to catch late-game calibration drift.

2Confidence intervals
Issue

MWPA is a difference of two proportions (build WR minus baseline WR). Naive CIs treat the baseline as fixed, understating true uncertainty.

Fix

Gold Diff uses Newcombe-Wilson intervals (Method 10) which account for uncertainty in both the build and baseline win rates. The UI flags wide intervals and marks low-sample results with an amber indicator.

3Hierarchical shrinkage
Issue

Narrowing the state space (patch + champion + role + matchup + item slot) makes estimates more meaningful, but sample sizes shrink fast.

Fix

Two-level shrinkage: each analysis batch estimates a champion-role prior from the games-weighted average raw effect, then shrinks individual rows toward that prior while keeping a fixed zero global anchor. This keeps rare builds honest without collapsing every low-sample estimate straight to zero.

4Attribution window
Issue

The thesis notes that determining when an item's effect ends is difficult. Using win/loss as the outcome absorbs unrelated later events.

Fix

Item-count-adjusted baselines compare each build against games completing the same number of items, isolating the build decision from game length survivorship bias.

5State representation
Issue

The thesis warns simplified state representations risk omitting important variables. The Riot API data is inherently limited.

Fix

Gold Diff enriches context with gold-at-purchase, timing buckets (ahead/even/behind), matchup damage type, and duration bucketing to reduce omitted-variable bias. Boots now use exact final-purchase timestamps when timeline data has been backfilled; older legacy rows fall back to a fixed minute-12 proxy until they are refreshed.

6Patch drift
Issue

Item values shift with balance changes and meta evolution. A model trained on Patch 14.15 data is stale by 14.18.

Fix

The scheduler exports current-patch analysis every 6 hours and retrains the WP model weekly. Model evaluation prefers a latest-patch holdout when possible and persists the train/test patch window in metadata, so patch drift is measured explicitly instead of assumed away.

7Temporal history
Issue

The WP model uses a single game state (Markov assumption). Momentum effects — comeback streaks, snowball runs — are not captured.

Fix

Gold Diff includes per-minute rate features (gold_diff_per_min, cs_diff_per_min) which implicitly encode momentum. Full sequence-aware models (RNN/Transformer) are on the roadmap.

8Publishing threshold
Issue

Without a clear threshold, low-sample build paths can appear as confident recommendations.

Fix

Results below the minimum sample threshold are excluded from the export. All published scores carry a sample quality label (high / moderate / low) and visible game count.

Accuracy checklist

Before any score is published

Calibrated win-probability model (isotonic + ECE tracking)
Latest-patch holdout validation + saved training-window metadata
Tight state filters (champion + tier + role)
Minimum sample threshold enforced
Newcombe-Wilson 95% CIs (accounts for baseline uncertainty)
Hierarchical shrinkage (champion-role prior anchored to zero)
Item-count-adjusted baselines
WP-delta enrichment for items, boots, and build order slots
Model-based eWPA for runes, full rune pages, summoner spells, and skill order when timeline data is available
Sample quality label attached (high / moderate / low)
Metric type labeled honestly (MWPA vs eWPA)
Gold Value Reference

What Boris charges per stat point

These per-unit gold values come from the cheapest basic component that provides each stat in isolation. They are the foundation of every gold efficiency calculation on the site.

Attack Damage
35g
from Long Sword
Ability Power
21.75g
from Amplifying Tome
Health
2.67g
from Ruby Crystal
Mana
1.4g
from Sapphire Crystal
Armor
20g
from Cloth Armor
Magic Resist
18g
from Null-Magic Mantle
Attack Speed
25g/%
from Dagger
Critical Strike
40g/%
from Cloak of Agility