Research

How every weight was decided.

Each player grade is a weighted composite of 2–7 statistical components. Picking those components — and choosing what to leave out — was the hard part. This is what we did, what we rejected, and what we learned.

189+

metrics evaluated

across 12 positions

in production formulas

28% acceptance rate

screening criteria

reliability · spread · independence · validity

The framework

Four criteria a metric must survive

For every plausible statistical metric in nflverse data, we score it against four criteria. A metric only enters a player’s grade if all four are convincing. The criteria together rule out random noise, non-distinguishing stats, redundant signals, and metrics that nobody in the football world actually rewards.

Reliability

Year over year

Does the same player tend to score similarly two years in a row? If a metric jumps around at random, it's measuring noise, not skill.

Technical: Pearson r between a player's value at season t and season t+1, averaged across all qualified player-season pairs.

Discrimination

Cross-sectional spread

Does the metric actually separate players within a season? If everyone is at 95% on it, it can't differentiate elite from average.

Technical: Standard deviation of the metric within a single season, in z-units. A near-zero std means everyone scores the same.

Independence

Adds new info

If this metric is just a different way of saying what another one already says, including both is double-counting.

Technical: Maximum absolute Pearson r between this candidate's z-scores and every other candidate's z-scores. ≥0.85 = strong redundancy.

Predictive validity

Pro Bowl correlation

Does what this metric measures actually look like "good football" to expert voters? An imperfect proxy, but the best public ground truth.

Technical: Pearson r between the metric value at season t and a 0/1 flag for whether the player made the Pro Bowl in season t+1.

A worked example

What the audit looks like for WR

Wide receiver got the largest metric set of any position — 22 statistics tested. Six survived. Below, all 22 grouped by what the audit decided for each.

Shipped in the formula

6— Passed all four criteria

Metric	Weight	YoY	x-sec	max \|r\|	Validity
Receiving EPA / target	+0.35	+0.310	+0.270	+0.760	+0.244
YAC over expected	+0.27	+0.408	+0.780	+0.218	+0.156
Target earn rate	+0.15	+0.612	+0.054	+0.347	+0.282
Separation (NGS)	+0.10	+0.521	+0.400	+0.218	+0.039
Success rate	+0.05	+0.272	+0.060	+0.760	+0.205
Drop rate (FTN)	-0.05	+0.124	+0.020	+0.108	-0.087

Overlapped a chosen metric

13— Either mathematically inside, or correlated > 0.5

Metric	max \|r\|	Overlaps with
Share of team air yards (NGS)	0.913	air yards share
Target share	0.892	target earn
Yards per target	0.853	rec epa
Intended air yards (NGS)	0.844	air yards per target
WOPR (target + air-yards composite)	0.821	target earn
First-down rate	0.812	success rate
Air yards share	0.737	wopr
YAC (raw, per reception)	0.731	yac over expected
Yards per reception	0.685	yac over expected
Red-zone target share	0.624	target earn
RACR (yards / air yards)	0.601	yac over expected
Air yards / target (NGS)	0.526	target earn
Catch percentage	0.474	drop rate

Failed audit

3— Below noise floor, anti-skill, or too rare to grade

Metric	YoY	Why excluded
Avg cushion (NGS)	+0.314	Pre-snap CB depth — defensive scheme indicator, not WR skill.
Contested catch %	+0.084	Small samples (10-25 contested targets/year). YoY barely above zero.
Fumble rate	+0.013	Removed in v1.1 — WR fumbles are too rare per season for skill signal. Was at -0.05 in v1.

When the framework caught a flaw

WR: the most validity-driven signal was underweighted

Methodology only earns trust by showing it changes when the data says it should. Most audit-driven changes are small — that’s what makes them honest. Here’s a representative one: WR v1 weighted Receiving EPA most heavily because EPA is the comprehensive value number. The audit caught two more subtle things going on underneath that intuition.

The problem v1 had

v1 weighted Receiving EPA most heavily (0.35) because EPA is the comprehensive value number. Target earn rate sat at 0.10 — treated as a usage marker rather than a skill signal. Success rate was at 0.08 as a secondary consistency metric.

The audit data

Target earn rate was the highest-validity WR metric

Metric	Pro Bowl validity	Year-over-year r	Weight (v1 → v1.3)
Target earn rate	+0.282	+0.612	0.10 → 0.15
Receiving EPA / target	+0.244	+0.310	0.35 (unchanged)
Success rate	+0.205	+0.272	0.08 → 0.05

Two findings the audit surfaced that weren't obvious from intuition: (1) Target earn rate had the highest Pro Bowl validity of any WR metric (+0.282) — the WRs who get targeted most aren't just being used heavily, they're being trusted by QBs because they earn it. The metric was meaningfully underweighted. (2) Success rate had a 0.760 correlation with EPA — mathematically, success rate is the share of targets that produce positive EPA, so the two metrics are partly redundant. Including both at full strength was double-counting.

The fix

Two weight changes, side-by-side

WR v1

Original — EPA-centric, target_earn as usage marker

Receiving EPA / target+0.35 · 37%
YAC over expected+0.27 · 28%
Separation (NGS)+0.10 · 11%
Target earn rate+0.10 · 11%
Success rate+0.08 · 8%
Drop rate (FTN)-0.05 · 5%

WR v1.3

After audit — target_earn elevated, success_rate trimmed

Receiving EPA / target+0.35 · 36%
YAC over expected+0.27 · 28%
Target earn rate+0.15 · 15%
Separation (NGS)+0.10 · 10%
Success rate+0.05 · 5%
Drop rate (FTN)-0.05 · 5%

The result

Validity moved +0.280 → +0.300 (a ~7% relative gain). Top-5 in 2024 unchanged — Justin Jefferson, Ja'Marr Chase, CeeDee Lamb, Mike Evans, Brandon Aiyuk all stayed put. The change is in the mid-pack ordering: high-volume target earners (e.g. Drake London, Cooper Kupp) moved up modestly; lower-volume efficient WRs moved down modestly.

The takeaway

Most audit-driven changes are small. The WRs at the top were already the right WRs. But the methodology demands consistency: when the audit shows a signal is underweighted or two signals are redundant, you fix it — even if the visible top-of-leaderboard effect is modest. The framework's value isn't always producing dramatic changes; it's making sure the formula's weight distribution matches the data's signal distribution.

The result, position by position

How well does each formula predict Pro Bowl voting?

After every weight is set, we test the composite grade against next-year Pro Bowl selection. It’s an imperfect ground truth — voters have their biases — but it’s the best public expert signal we have. The chart below makes the structural ceiling on each position clear.

iDL

+0.475

Strongest — interior pressure stats align well with what voters reward.

EDGE

+0.424

Strong — sack and pressure stats track Pro Bowl voting closely.

+0.407

Strong — receiving stats align with voter consensus.

+0.300

Moderate — EPA depends partly on QB quality.

+0.259

Moderate — rushing share is partly contextual (game script, OL).

+0.255

Moderate — INT-driven voter noise.

+0.244

Moderate — small Pro Bowl roster, surface stats matter most.

+0.220

High voter noise — CB Pro Bowl voting is heavily INT-driven.

+0.198

Reputation gap — voters reward LB reputation more than box score.

+0.153

Stats-vs-reputation gap — only 2 K Pro Bowls/year, noisy voting.

+0.122

Lowest — punter Pro Bowl voting is the most reputation-driven.

N/A

No validity gate — there is no "All-Pro OL unit" award. Documented honestly.

Why the bottom of the chart

LB, K, and P sit lowest because their Pro Bowl voting is heavily reputation-driven. Roquan Smith routinely makes the Pro Bowl on box-score numbers that don’t scream elite. Only 2 K and 2 P slots exist per year — the smallest cohorts and noisiest voting on the board.

Why OL is N/A

There’s no “All-Pro OL unit” award. Individual Pro Bowl OL counts per team are too noisy to use as a hard validation gate (some Pro Bowls go to bad-unit veterans on reputation). We documented this honestly rather than ginning up a weak proxy.

The full audit log

What we considered, what we shipped, what we rejected

The articles in the analytics community usually show you the formula and tell you it’s good. Here’s what we did before landing on the formula. The funnel below shows how many metrics each position evaluated and how many made it into the live grade.

Position

Evaluated → in formula

In/Out

19→3

16%

22→7

32%

22→6

27%

22→6

27%

13→2

15%

16→4

25%

16→6

38%

EDGE

10→5

50%

iDL

10→5

50%

19→6

32%

10→1

10%

10→2

20%

Total · 189 metrics evaluated · 53 in production · 136 rejected

The rejection log

Selected rejections, by pattern

Every rejected metric has a documented reason. These are the most instructive ones — they show the patterns that recurred across positions. Filter by pattern or position; click a row for the full explanation.

Pattern

Position

Position	Candidate	Pattern	YoY	Validity	Reason
K	FG% (0-39 yards)	Anti-skill	-0.135	-0.087	Negative YoY — regression to ceiling, not skill
K	Game-winning FG %	Small sample	n/a	0.000	Pure noise (n=49)
P	Block rate	Noise	-0.046	-0.046	Snap/protection failure, not punter skill
P	EPA per punt	Subsumed	+0.269	+0.163	Doesn't dominate — context-contaminated
S	ADoT allowed	Noise	+0.235	-0.032	Scheme indicator, not skill
OL	False-start rate	Noise	+0.129	skipped	Below YoY noise floor
OL	Rush yards per carry	Subsumed	+0.364	skipped	Mixes OL with RB after-contact value
EDGE	Hit per pressure	Noise	+0.350	-0.038	Counter-intuitive negative validity
iDL	Sack per pressure	Noise	+0.008	+0.069	Pure noise at iDL sample sizes
TE	Separation (NGS)	Noise	+0.510	-0.143	NEGATIVE Pro Bowl validity at TE
WR	Avg cushion (NGS)	Noise	+0.314	-0.018	Defensive scheme indicator
WR	Contested catch %	Small sample	+0.084	+0.034	Small samples kill the signal
QB	Pressure faced rate	Context only	+0.397	+0.012	Captures OL quality, not QB skill
RB	Catch percentage (RB)	Noise	+0.184	-0.012	Removed in v1.1 — noise + redundant
LB	Forced fumbles per snap	Small sample	+0.197	+0.055	Cross-sectional std 0.00 — extremely rare

Showing 15 of 15 featured metric rejections. Click a row for the full explanation.

What we learned

Patterns that compounded across positions

Twelve audits surfaced four recurring lessons. They generalize beyond football grading — any composite-metric system runs into them.

Lesson 01

Isolation beats contamination

“When possible, pick the metric that strips out non-player context.”

FG% over expected (kickers) strips kick difficulty. Net average (punters) strips return-team value. Yards before contact (OL) strips RB after-contact ability. The pattern: when a raw stat mixes player skill with non-player factors, the over-expected or isolated version usually has better year-to-year reliability.

K: FGOE/att over raw FG%P: net avg over gross avgOL: YBC/carry over rush yards/carry

Lesson 02

Over-expected isn't always best

“The K lesson didn't fully generalize to P.”

The over-expected approach works when the baseline is well-isolated. FG distance baselines are stable (every 40-yard FG faces the same challenge). Punt EPA depends on the returner, the coverage team, the wind, the field position — non-punter variance dilutes the signal. For punters, the simpler raw rate (net average) won the audit on both YoY and validity.

K v1.1: FGOE dominatesP v1.1: net avg beats EPA per punt

Lesson 03

Document what you DIDN'T ship

“The audit log is the methodology's credibility.”

For each position we tested 10-22 candidates and shipped 2-7. The 100+ rejected candidates are documented with their YoY, validity, and the reason for rejection. A formula is only defensible if the alternatives that didn't make it are visible — "we considered X and here's why we excluded it" beats "we picked these because it felt right."

190+ candidates evaluated, 52 in production formulas

Lesson 04

Methodology has to self-correct

“When the audit catches a flaw, fix the formula.”

The WR rebalance above is one example of the framework correcting a small underweighting. There are bigger ones: iDL v1.2 swapped its primary signal entirely — TFL was 35% of the formula and got cut to 25% when the audit revealed pressure was both more reliable AND more Pro Bowl-validated. K v1 used raw FG%, which actively punished kickers for attempting long FGs; v1.1 replaced the entire formula with FG over expected within hours. P v1 included blocked_rate at small weight on "punter conceptually owns the play" grounds; v1.1 removed it when the audit signal was too weak. The framework only earns trust by showing it changes when the data says it should — at any magnitude, even when the change is unflattering to the original design.

WR v1.3 (rebalance)iDL v1.2 (primary-signal swap)K v1 → v1.1 (FGOE)P v1 → v1.1 (drop blocked_rate)

Team weights

The same framework, applied one level up

Player grades aggregate into team grades through a two-stage formula: snap-weight within each position, then position-weight within each phase. Both stages of weights were derived the same way as the per-position formulas — empirically. Ridge regression of team success against the per-position team grades produces the regression coefficients below; salary-cap allocation is the market-derived second anchor. Shipped weights reconcile both anchors with sample-size humility.

R²=0.79

phase model fit

point diff ~ off + def + st

222

team-seasons audited

32 teams × 7 seasons (2018-2024)

empirical anchors

ridge regression + cap allocation

Phase weights

Offense outweighs defense; ST is the small slice

Regression on three phase grades fits at R² = 0.79 against point differential — a strong signal. The original priors balanced offense and defense 0.45 / 0.45 on principle; the data says modern NFL is offense-tilted, and ST contributes closer to its salary-cap allocation (~2%) than to a gut-feel 10%.

Phase	Prior	Cap %	Reg (PD)	Reg (spread)	v1.0 shipped
Offense	0.45	0.49	0.58	0.64	0.55
Defense	0.45	0.49	0.36	0.34	0.40
S. teams	0.10	0.02	0.06	0.02	0.05

Position weights

Per-position contribution to each phase

Within each phase, position weights determine which positions carry the composite. QB dominates offense, EDGE and CB share top billing on defense, and ST is essentially a 50/50 K/P split. The univariate column (Pearson r against team point diff) helps spot multicollinearity — WR collapses to ~0 multivariate but still correlates strongly on its own.

R² = 0.60

Offense

QBprior 0.40 → 0.45
cap 0.28reg 0.61r +0.74
OLprior 0.25 → 0.25
cap 0.43reg 0.21r +0.51
WRprior 0.15 → 0.13
cap 0.18reg 0.01r +0.52
RBprior 0.10 → 0.09
cap 0.05reg 0.09r +0.42
TEprior 0.10 → 0.08
cap 0.06reg 0.08r +0.49

R² = 0.35

Defense

EDGEprior 0.25 → 0.24
cap 0.27reg 0.23r +0.32
CBprior 0.25 → 0.24
cap 0.24reg 0.26r +0.39
LBprior 0.20 → 0.22
cap 0.16reg 0.25r +0.30
Sprior 0.15 → 0.20
cap 0.14reg 0.25r +0.31
iDLprior 0.15 → 0.10
cap 0.19reg 0.01r +0.12

R² = 0.04

Special teams

Kprior 0.55 → 0.52
cap 0.52reg 0.51r +0.15
Pprior 0.45 → 0.48
cap 0.48reg 0.49r +0.15

Headline findings

What moved from the gut-feel prior

QB is heavier than any gut-feel prior

Regression coefficient 0.61. Univariate Pro-Bowl r of 0.74 — highest of any position by a wide margin. The salary cap (0.28) undersells QB because cap allocation reflects supply scarcity (only 32 starting QBs), not on-field contribution.

Prior 0.40 → v1.0 0.45

iDL is the lightest position

Regression coefficient 0.01. Univariate r of 0.12 — lowest of any position. Trimmed from prior 0.15 to 0.10. (Kept non-zero because the iDL grade itself may under-capture interior pressure; the regression may be confirming a weak grade, not a weak position.)

Prior 0.15 → v1.0 0.10

Offense outweighs defense in modern NFL

Phase regression said 0.58 / 0.36 / 0.06. The prior 0.45 / 0.45 was symmetric on principle; the data says modern football is offense-tilted. Reconciled to 0.55 / 0.40 / 0.05 — substantial move toward regression without going all the way.

Phase: 0.45 / 0.45 / 0.10 → 0.55 / 0.40 / 0.05

WR collapses in multivariate but is real univariately

WR's multivariate weight is 0.01 — but univariate r is 0.52. The regression can't separate WR from QB because the two grades are correlated (good QBs make receivers look better). Held at 0.13 instead of dropping to 0, accepting some double-counting with QB rather than pretending WR doesn't matter.

Prior 0.15 → v1.0 0.13

The honest limitation

Team grades measure per-snap player quality, snap-weighted across a team’s available roster. They do not measure win-loss record. Teams that outperform their efficiency stats (clutch close-game wins) tend to grade lower than their record suggests; teams whose stars missed games to injury tend to grade higher (snaps from healthy stars still count fully). We documented this in ADR-0026 rather than tuning the formula to chase wins — that’s a v2 question.

Each position’s full audit doc lives in the repo under docs/grading/audits/. The corresponding ADRs (architectural decision records) at docs/adr/ have the full rationale, alternatives considered, and known limitations for each formula version.

Want to see how the formulas play out in practice? Browse the leaderboards · Read the per-position methodology.

How every weight was decided.

189+

metrics evaluated

across 12 positions

in production formulas

28% acceptance rate

screening criteria

reliability · spread · independence · validity

Metric

Weight

YoY

x-sec

max |r|

Validity

Receiving EPA / target

+0.35

+0.310

+0.270

+0.760

+0.244

YAC over expected

+0.27

+0.408

+0.780

+0.218

+0.156

Target earn rate

+0.15

+0.612

+0.054

+0.347

+0.282

Separation (NGS)

+0.10

+0.521

+0.400

+0.218

+0.039

Success rate

+0.05

+0.272

+0.060

+0.760

+0.205

Drop rate (FTN)

-0.05

+0.124

+0.020

+0.108

-0.087

Metric

max |r|

Overlaps with

Share of team air yards (NGS)

0.913

air yards share

Target share

0.892

target earn

Yards per target

0.853

rec epa

Intended air yards (NGS)

0.844

air yards per target

WOPR (target + air-yards composite)

0.821

target earn

First-down rate

0.812

success rate

Air yards share

0.737

wopr

YAC (raw, per reception)

0.731

yac over expected

Yards per reception

0.685

yac over expected

Red-zone target share

0.624

target earn

RACR (yards / air yards)

0.601

yac over expected

Air yards / target (NGS)

0.526

target earn

Catch percentage

0.474

drop rate

Metric

YoY

Why excluded

Avg cushion (NGS)

+0.314

Pre-snap CB depth — defensive scheme indicator, not WR skill.

Contested catch %

+0.084

Small samples (10-25 contested targets/year). YoY barely above zero.

Fumble rate

+0.013

Removed in v1.1 — WR fumbles are too rare per season for skill signal. Was at -0.05 in v1.

Metric

Pro Bowl validity

Year-over-year r

Weight (v1 → v1.3)

Target earn rate

+0.282

+0.612

0.10 → 0.15

Receiving EPA / target

+0.244

+0.310

0.35 (unchanged)

Success rate

+0.205

+0.272

0.08 → 0.05

Position

Candidate

Pattern

YoY

Validity

Reason

FG% (0-39 yards)

Anti-skill

-0.135

-0.087

Negative YoY — regression to ceiling, not skill

Game-winning FG %

Small sample

n/a

0.000

Pure noise (n=49)

Block rate

Noise

-0.046

Snap/protection failure, not punter skill

EPA per punt

Subsumed

+0.269

+0.163

Doesn't dominate — context-contaminated

ADoT allowed

Noise

+0.235

-0.032

Scheme indicator, not skill

False-start rate

Noise

+0.129

skipped

Below YoY noise floor

Rush yards per carry

Subsumed

+0.364

skipped

Mixes OL with RB after-contact value

EDGE

Hit per pressure

Noise

+0.350

-0.038

Counter-intuitive negative validity

iDL

Sack per pressure

Noise

+0.008

+0.069

Pure noise at iDL sample sizes

Separation (NGS)

Noise

+0.510

-0.143

NEGATIVE Pro Bowl validity at TE

Avg cushion (NGS)

Noise

+0.314

-0.018

Defensive scheme indicator

Contested catch %

Small sample

+0.084

+0.034

Small samples kill the signal

Pressure faced rate

Context only

+0.397

+0.012

Captures OL quality, not QB skill

Catch percentage (RB)

Noise

+0.184

-0.012

Removed in v1.1 — noise + redundant

Forced fumbles per snap

Small sample

+0.197

+0.055

Cross-sectional std 0.00 — extremely rare

Phase

Prior

Cap %

Reg (PD)

Reg (spread)

v1.0 shipped

Offense

0.45

0.49

0.58

0.64

0.55

Defense

0.45

0.49

0.36

0.34

0.40

S. teams

0.10

0.02

0.06

0.02

0.05