Each player grade is a weighted composite of 2–7 statistical components. Picking those components — and choosing what to leave out — was the hard part. This is what we did, what we rejected, and what we learned.
For every plausible statistical metric in nflverse data, we score it against four criteria. A metric only enters a player’s grade if all four are convincing. The criteria together rule out random noise, non-distinguishing stats, redundant signals, and metrics that nobody in the football world actually rewards.
Does the same player tend to score similarly two years in a row? If a metric jumps around at random, it's measuring noise, not skill.
Technical: Pearson r between a player's value at season t and season t+1, averaged across all qualified player-season pairs.
Does the metric actually separate players within a season? If everyone is at 95% on it, it can't differentiate elite from average.
Technical: Standard deviation of the metric within a single season, in z-units. A near-zero std means everyone scores the same.
If this metric is just a different way of saying what another one already says, including both is double-counting.
Technical: Maximum absolute Pearson r between this candidate's z-scores and every other candidate's z-scores. ≥0.85 = strong redundancy.
Does what this metric measures actually look like "good football" to expert voters? An imperfect proxy, but the best public ground truth.
Technical: Pearson r between the metric value at season t and a 0/1 flag for whether the player made the Pro Bowl in season t+1.
Wide receiver got the largest metric set of any position — 22 statistics tested. Six survived. Below, all 22 grouped by what the audit decided for each.
| Metric | Weight | YoY | x-sec | max |r| | Validity |
|---|---|---|---|---|---|
Receiving EPA / target | +0.35 | +0.310 | +0.270 | +0.760 | +0.244 |
YAC over expected | +0.27 | +0.408 | +0.780 | +0.218 | +0.156 |
Target earn rate | +0.15 | +0.612 | +0.054 | +0.347 | +0.282 |
Separation (NGS) | +0.10 | +0.521 | +0.400 | +0.218 | +0.039 |
Success rate | +0.05 | +0.272 | +0.060 | +0.760 | +0.205 |
Drop rate (FTN) | -0.05 | +0.124 | +0.020 | +0.108 | -0.087 |
| Metric | max |r| | Overlaps with |
|---|---|---|
| Share of team air yards (NGS) | 0.913 | air yards share |
| Target share | 0.892 | target earn |
| Yards per target | 0.853 | rec epa |
| Intended air yards (NGS) | 0.844 | air yards per target |
| WOPR (target + air-yards composite) | 0.821 | target earn |
| First-down rate | 0.812 | success rate |
| Air yards share | 0.737 | wopr |
| YAC (raw, per reception) | 0.731 | yac over expected |
| Yards per reception | 0.685 | yac over expected |
| Red-zone target share | 0.624 | target earn |
| RACR (yards / air yards) | 0.601 | yac over expected |
| Air yards / target (NGS) | 0.526 | target earn |
| Catch percentage | 0.474 | drop rate |
| Metric | YoY | Why excluded |
|---|---|---|
| Avg cushion (NGS) | +0.314 | Pre-snap CB depth — defensive scheme indicator, not WR skill. |
| Contested catch % | +0.084 | Small samples (10-25 contested targets/year). YoY barely above zero. |
| Fumble rate | +0.013 | Removed in v1.1 — WR fumbles are too rare per season for skill signal. Was at -0.05 in v1. |
Methodology only earns trust by showing it changes when the data says it should. Most audit-driven changes are small — that’s what makes them honest. Here’s a representative one: WR v1 weighted Receiving EPA most heavily because EPA is the comprehensive value number. The audit caught two more subtle things going on underneath that intuition.
v1 weighted Receiving EPA most heavily (0.35) because EPA is the comprehensive value number. Target earn rate sat at 0.10 — treated as a usage marker rather than a skill signal. Success rate was at 0.08 as a secondary consistency metric.
| Metric | Pro Bowl validity | Year-over-year r | Weight (v1 → v1.3) |
|---|---|---|---|
| Target earn rate | +0.282 | +0.612 | 0.10 → 0.15 |
| Receiving EPA / target | +0.244 | +0.310 | 0.35 (unchanged) |
| Success rate | +0.205 | +0.272 | 0.08 → 0.05 |
Two findings the audit surfaced that weren't obvious from intuition: (1) Target earn rate had the highest Pro Bowl validity of any WR metric (+0.282) — the WRs who get targeted most aren't just being used heavily, they're being trusted by QBs because they earn it. The metric was meaningfully underweighted. (2) Success rate had a 0.760 correlation with EPA — mathematically, success rate is the share of targets that produce positive EPA, so the two metrics are partly redundant. Including both at full strength was double-counting.
Validity moved +0.280 → +0.300 (a ~7% relative gain). Top-5 in 2024 unchanged — Justin Jefferson, Ja'Marr Chase, CeeDee Lamb, Mike Evans, Brandon Aiyuk all stayed put. The change is in the mid-pack ordering: high-volume target earners (e.g. Drake London, Cooper Kupp) moved up modestly; lower-volume efficient WRs moved down modestly.
Most audit-driven changes are small. The WRs at the top were already the right WRs. But the methodology demands consistency: when the audit shows a signal is underweighted or two signals are redundant, you fix it — even if the visible top-of-leaderboard effect is modest. The framework's value isn't always producing dramatic changes; it's making sure the formula's weight distribution matches the data's signal distribution.
After every weight is set, we test the composite grade against next-year Pro Bowl selection. It’s an imperfect ground truth — voters have their biases — but it’s the best public expert signal we have. The chart below makes the structural ceiling on each position clear.
LB, K, and P sit lowest because their Pro Bowl voting is heavily reputation-driven. Roquan Smith routinely makes the Pro Bowl on box-score numbers that don’t scream elite. Only 2 K and 2 P slots exist per year — the smallest cohorts and noisiest voting on the board.
There’s no “All-Pro OL unit” award. Individual Pro Bowl OL counts per team are too noisy to use as a hard validation gate (some Pro Bowls go to bad-unit veterans on reputation). We documented this honestly rather than ginning up a weak proxy.
The articles in the analytics community usually show you the formula and tell you it’s good. Here’s what we did before landing on the formula. The funnel below shows how many metrics each position evaluated and how many made it into the live grade.
Every rejected metric has a documented reason. These are the most instructive ones — they show the patterns that recurred across positions. Filter by pattern or position; click a row for the full explanation.
| Position | Candidate | Pattern | YoY | Validity | Reason |
|---|---|---|---|---|---|
| K | FG% (0-39 yards) | Anti-skill | -0.135 | -0.087 | Negative YoY — regression to ceiling, not skill |
| K | Game-winning FG % | Small sample | n/a | 0.000 | Pure noise (n=49) |
| P | Block rate | Noise | -0.046 | -0.046 | Snap/protection failure, not punter skill |
| P | EPA per punt | Subsumed | +0.269 | +0.163 | Doesn't dominate — context-contaminated |
| S | ADoT allowed | Noise | +0.235 | -0.032 | Scheme indicator, not skill |
| OL | False-start rate | Noise | +0.129 | skipped | Below YoY noise floor |
| OL | Rush yards per carry | Subsumed | +0.364 | skipped | Mixes OL with RB after-contact value |
| EDGE | Hit per pressure | Noise | +0.350 | -0.038 | Counter-intuitive negative validity |
| iDL | Sack per pressure | Noise | +0.008 | +0.069 | Pure noise at iDL sample sizes |
| TE | Separation (NGS) | Noise | +0.510 | -0.143 | NEGATIVE Pro Bowl validity at TE |
| WR | Avg cushion (NGS) | Noise | +0.314 | -0.018 | Defensive scheme indicator |
| WR | Contested catch % | Small sample | +0.084 | +0.034 | Small samples kill the signal |
| QB | Pressure faced rate | Context only | +0.397 | +0.012 | Captures OL quality, not QB skill |
| RB | Catch percentage (RB) | Noise | +0.184 | -0.012 | Removed in v1.1 — noise + redundant |
| LB | Forced fumbles per snap | Small sample | +0.197 | +0.055 | Cross-sectional std 0.00 — extremely rare |
Twelve audits surfaced four recurring lessons. They generalize beyond football grading — any composite-metric system runs into them.
FG% over expected (kickers) strips kick difficulty. Net average (punters) strips return-team value. Yards before contact (OL) strips RB after-contact ability. The pattern: when a raw stat mixes player skill with non-player factors, the over-expected or isolated version usually has better year-to-year reliability.
The over-expected approach works when the baseline is well-isolated. FG distance baselines are stable (every 40-yard FG faces the same challenge). Punt EPA depends on the returner, the coverage team, the wind, the field position — non-punter variance dilutes the signal. For punters, the simpler raw rate (net average) won the audit on both YoY and validity.
For each position we tested 10-22 candidates and shipped 2-7. The 100+ rejected candidates are documented with their YoY, validity, and the reason for rejection. A formula is only defensible if the alternatives that didn't make it are visible — "we considered X and here's why we excluded it" beats "we picked these because it felt right."
The WR rebalance above is one example of the framework correcting a small underweighting. There are bigger ones: iDL v1.2 swapped its primary signal entirely — TFL was 35% of the formula and got cut to 25% when the audit revealed pressure was both more reliable AND more Pro Bowl-validated. K v1 used raw FG%, which actively punished kickers for attempting long FGs; v1.1 replaced the entire formula with FG over expected within hours. P v1 included blocked_rate at small weight on "punter conceptually owns the play" grounds; v1.1 removed it when the audit signal was too weak. The framework only earns trust by showing it changes when the data says it should — at any magnitude, even when the change is unflattering to the original design.
Player grades aggregate into team grades through a two-stage formula: snap-weight within each position, then position-weight within each phase. Both stages of weights were derived the same way as the per-position formulas — empirically. Ridge regression of team success against the per-position team grades produces the regression coefficients below; salary-cap allocation is the market-derived second anchor. Shipped weights reconcile both anchors with sample-size humility.
Regression on three phase grades fits at R² = 0.79 against point differential — a strong signal. The original priors balanced offense and defense 0.45 / 0.45 on principle; the data says modern NFL is offense-tilted, and ST contributes closer to its salary-cap allocation (~2%) than to a gut-feel 10%.
| Phase | Prior | Cap % | Reg (PD) | Reg (spread) | v1.0 shipped |
|---|---|---|---|---|---|
| Offense | 0.45 | 0.49 | 0.58 | 0.64 | 0.55 |
| Defense | 0.45 | 0.49 | 0.36 | 0.34 | 0.40 |
| S. teams | 0.10 | 0.02 | 0.06 | 0.02 | 0.05 |
Within each phase, position weights determine which positions carry the composite. QB dominates offense, EDGE and CB share top billing on defense, and ST is essentially a 50/50 K/P split. The univariate column (Pearson r against team point diff) helps spot multicollinearity — WR collapses to ~0 multivariate but still correlates strongly on its own.
Regression coefficient 0.61. Univariate Pro-Bowl r of 0.74 — highest of any position by a wide margin. The salary cap (0.28) undersells QB because cap allocation reflects supply scarcity (only 32 starting QBs), not on-field contribution.
Regression coefficient 0.01. Univariate r of 0.12 — lowest of any position. Trimmed from prior 0.15 to 0.10. (Kept non-zero because the iDL grade itself may under-capture interior pressure; the regression may be confirming a weak grade, not a weak position.)
Phase regression said 0.58 / 0.36 / 0.06. The prior 0.45 / 0.45 was symmetric on principle; the data says modern football is offense-tilted. Reconciled to 0.55 / 0.40 / 0.05 — substantial move toward regression without going all the way.
WR's multivariate weight is 0.01 — but univariate r is 0.52. The regression can't separate WR from QB because the two grades are correlated (good QBs make receivers look better). Held at 0.13 instead of dropping to 0, accepting some double-counting with QB rather than pretending WR doesn't matter.
Team grades measure per-snap player quality, snap-weighted across a team’s available roster. They do not measure win-loss record. Teams that outperform their efficiency stats (clutch close-game wins) tend to grade lower than their record suggests; teams whose stars missed games to injury tend to grade higher (snaps from healthy stars still count fully). We documented this in ADR-0026 rather than tuning the formula to chase wins — that’s a v2 question.
Each position’s full audit doc lives in the repo under docs/grading/audits/. The corresponding ADRs (architectural decision records) at docs/adr/ have the full rationale, alternatives considered, and known limitations for each formula version.
Want to see how the formulas play out in practice? Browse the leaderboards · Read the per-position methodology.