Methodology

How Gold Diff turns build choices into MWPA scores.

The site is designed to answer one question cleanly: did this build choice actually help, or did it just show up more often in games that were already won? MWPA estimates the difference by comparing each option to an appropriate baseline instead of raw global win rate.

Academic Foundation

Attribution

xPetu's MWPA thesis

The Mean Win Probability Added framework used throughout Gold Diff is directly adapted from research by xPetu. The core thesis: raw win rates on items and runes are misleading because they conflate correlation with causation. MWPA isolates the contribution of a single build decision by comparing outcomes against an appropriate contextual baseline, averaged across many games.

Without xPetu's work, most build sites would still be reporting that Mejai's Soulstealer has a 75% win rate and calling it a recommendation. MWPA cuts through that noise.

Key insight

“An item with 60% win rate might only appear in games that are already won. MWPA asks: did buying this item actually change the probability of winning, or was the game decided before the purchase?”

— Adapted from xPetu's build analysis framework

+MWPA

Item genuinely increased win probability across the sample

-MWPA

Item was associated with lower win probability after accounting for game state

Core ideas

Core idea

Champion baseline

Every score starts from the champion's own win rate inside the current sample. That keeps naturally strong or weak champions from distorting the read.

Core idea

Eligible baselines

Later build checkpoints are only compared against games that actually reached them. A third-item option is not measured against one-item stomps.

Core idea

Composite signal

Items, boots, rune pages, and summoner spells stack into a composite read. The total is directional guidance, not an exact promise of match outcome.

The Mathematics

Two lenses on build quality

Gold Diff uses two complementary metrics. MWPA measures real in-game impact on win probability. Gold Efficiency measures raw stat value per gold spent. Together they answer: is this item actually winning games, and is it a good deal from the shop?

In-Game MWPA (Items, Boots, Build Order)

MWPA(a) ≈ avg(outcome − WP(model @ decision state))

athe build decision (specific item or item path)WP(model)calibrated pre-decision win probability from the dense timeline modelavg(...)average across all observed decisions of that type

When dense timeline states are missing, Gold Diff falls back to item-count-adjusted excess win rate against eligible baselines. That fallback is still debiased, but it is not the same thing as decision-time WP-delta.

How to read

Positive

Item increases win chance

Zero

Neutral impact

Negative

Item hurts win chance

Gold Efficiency (Traditional Item Value)

GE = Σ(Statᵢ × GoldValueᵢ)Item Cost × 100

Convert each stat on the item into gold using reference items (Long Sword = 35g per AD, Amp Tome = 21.75g per AP, Ruby Crystal = 2.67g per HP, etc). Sum the stat values, divide by item cost, multiply by 100.

Stat_iquantity of each stat on the itemGoldValue_igold worth of one unit of that stat (from reference items)

How to read

>100%

Very efficient buy

100%

Exactly worth the gold

<100%

Inefficient unless passive is strong

Note

Gold efficiency only counts base stats available in Data Dragon (AD, AP, HP, Armor, MR, AS, Crit, MS, Mana). Passive effects, ability haste, lethality, and unique actives are not factored into the GE percentage. An item with 80% GE might still be excellent if its passive is worth the gap. Gold Diff now pairs GE with passive profile notes and purchase-context reads so low-GE items can still be understood without pretending every passive has one exact gold price.

Quick comparison

MWPA

Measures change in probability of winning. Accounts for game state context. Tells you if the item actually helped win games, not just if winners bought it.

Gold Efficiency

Measures stat value per gold spent. Pure economy metric. Tells you if Boris is ripping you off on raw stats, independent of whether those stats win games.

Item path scoring

Gold Diff tracks the first three completed items from timeline data. Each exact one-item, two-item, and three-item prefix is scored against the win rate of games that reached the same prefix length.

fallback path score = path win rate − eligible baseline

Example: a two-item opening is compared to other two-item openings for that same champion, not to the champion's overall win rate across all game states. When dense timeline frames exist, the exported ranking is upgraded again with decision-time WP-delta.

Boots remain separate because they often slot in between major completions and behave more like a supporting choice than a core ordered path decision.

The site also stores average completion timing and purchase gold for each slot item. That lets the UI flag whether an option usually lands ahead, behind, early, or on-curve relative to the champion's normal purchase window.

Pregame choice scoring (eWPA)

Runes, full rune pages, summoner spells, skill orders, and starting items are pregame choices — locked in before the match starts. Because there is no in-game purchase moment, these cannot use a true decision-time WP-delta. When early timeline data is available, Gold Diff instead uses eWPA (Expected Win Probability Added): the option's observed win rate minus the model's predicted win probability across early evaluation windows (5, 10, and 15 minutes). If that model baseline is unavailable, the site falls back to contextual excess win rate and marks the row as a pregame fallback instead of pretending it was model-derived.

eWPA(option) = WR(option) − avg WP(model @ {5,10,15}m | option)

The model's early-game expectation becomes the baseline instead of pretending pregame choices have an in-match purchase timestamp.

Full rune pages are evaluated as a unit, and skill orders are grouped by the champion's Q/W/E max pattern. When you select a tree, keystone, and individual rows, the site filters to sampled pages that match those choices and uses their eWPA. Partial pages use a weighted blend of matching full pages.

This keeps rune scoring grounded in real combinations instead of pretending each rune exists in isolation from the rest of the page.

How to read the numbers

Positive MWPA

The option outperformed the relevant baseline in the observed sample.

Negative MWPA

The option underperformed the relevant baseline in the observed sample.

Confidence

Larger sample sizes deserve more trust. Rare niche paths can still swing hard even after debiasing.

Beyond the thesis

Where raw MWPA breaks down — and how we fix it

The xPetu thesis provides the theoretical foundation, but applying it to real ranked data surfaces practical failure modes. Gold Diff implements corrections for each one. The scoring pipeline is:

raw excess WR → WP-delta enrichment → hierarchical shrinkage → Newcombe-Wilson CI

Each score starts as excess win rate over the eligible baseline. When timeline data is available, in-game choices (items, boots, build order) are enriched with a calibrated WP-delta at decision time. Pregame choices (runes, full pages, summoner spells, skill order, starting items) use model-predicted eWPA when the early-state baseline exists, and otherwise stay flagged as contextual fallbacks. All scores are then shrunk toward a champion-role prior anchored to a fixed zero global baseline and paired with confidence intervals.

1Model calibration

Issue

MWPA inherits errors from the win-probability model. If predicted probabilities don't match observed frequencies, every downstream score is biased.

Fix

The WP model is wrapped with isotonic calibration (CalibratedClassifierCV) before it is used for WP-delta enrichment on items, boots, and build-order slots, plus early-window eWPA on runes, full rune pages, summoner spells, and skill orders. We track ECE (Expected Calibration Error) before and after, plus time-sliced ECE to catch late-game calibration drift.

2Confidence intervals

Issue

MWPA is a difference of two proportions (build WR minus baseline WR). Naive CIs treat the baseline as fixed, understating true uncertainty.

Fix

Gold Diff uses Newcombe-Wilson intervals (Method 10) which account for uncertainty in both the build and baseline win rates. The UI flags wide intervals and marks low-sample results with an amber indicator.

3Hierarchical shrinkage

Issue

Narrowing the state space (patch + champion + role + matchup + item slot) makes estimates more meaningful, but sample sizes shrink fast.

Fix

Two-level shrinkage: each analysis batch estimates a champion-role prior from the games-weighted average raw effect, then shrinks individual rows toward that prior while keeping a fixed zero global anchor. This keeps rare builds honest without collapsing every low-sample estimate straight to zero.

4Attribution window

Issue

The thesis notes that determining when an item's effect ends is difficult. Using win/loss as the outcome absorbs unrelated later events.

Fix

Item-count-adjusted baselines compare each build against games completing the same number of items, isolating the build decision from game length survivorship bias.

5State representation

Issue

The thesis warns simplified state representations risk omitting important variables. The Riot API data is inherently limited.

Fix

Gold Diff enriches context with gold-at-purchase, timing buckets (ahead/even/behind), matchup damage type, and duration bucketing to reduce omitted-variable bias. Boots now use exact final-purchase timestamps when timeline data has been backfilled; older legacy rows fall back to a fixed minute-12 proxy until they are refreshed.

6Patch drift

Issue

Item values shift with balance changes and meta evolution. A model trained on Patch 14.15 data is stale by 14.18.

Fix

The scheduler exports current-patch analysis every 6 hours and retrains the WP model weekly. Model evaluation prefers a latest-patch holdout when possible and persists the train/test patch window in metadata, so patch drift is measured explicitly instead of assumed away.

7Temporal history

Issue

The WP model uses a single game state (Markov assumption). Momentum effects — comeback streaks, snowball runs — are not captured.

Fix

Gold Diff includes per-minute rate features (gold_diff_per_min, cs_diff_per_min) which implicitly encode momentum. Full sequence-aware models (RNN/Transformer) are on the roadmap.

8Publishing threshold

Issue

Without a clear threshold, low-sample build paths can appear as confident recommendations.

Fix

Results below the minimum sample threshold are excluded from the export. All published scores carry a sample quality label (high / moderate / low) and visible game count.

Accuracy checklist

Before any score is published

✓Calibrated win-probability model (isotonic + ECE tracking)

✓Latest-patch holdout validation + saved training-window metadata

✓Tight state filters (champion + tier + role)

✓Minimum sample threshold enforced

✓Newcombe-Wilson 95% CIs (accounts for baseline uncertainty)

✓Hierarchical shrinkage (champion-role prior anchored to zero)

✓Item-count-adjusted baselines

✓WP-delta enrichment for items, boots, and build order slots

✓Model-based eWPA for runes, full rune pages, summoner spells, and skill order when timeline data is available

✓Sample quality label attached (high / moderate / low)

✓Metric type labeled honestly (MWPA vs eWPA)

Gold Value Reference

What Boris charges per stat point

These per-unit gold values come from the cheapest basic component that provides each stat in isolation. They are the foundation of every gold efficiency calculation on the site.

Attack Damage

35g

from Long Sword

Ability Power

21.75g

from Amplifying Tome

Health

2.67g

from Ruby Crystal

Mana

1.4g

from Sapphire Crystal

Armor

20g

from Cloth Armor

Magic Resist

18g

from Null-Magic Mantle

Attack Speed

25g/%

from Dagger

Critical Strike

40g/%

from Cloak of Agility