Design decisions
Architecture decision records (ADRs) — the technical choices behind the project, why we made each one, and what trade-offs we accepted. Append-only and numbered. If you want the consumer-facing explanation of how grades work, see How grades work.
Monorepo with shared `db/` as schema source of truth
- Status: Accepted
- Date: 2026-04-22
Context#
This project has two distinct codebases:
- A Python pipeline that ingests data from
nfl_data_pyand writes per-season + career grades to Postgres. - A Next.js web app that reads from Postgres and renders teams, depth charts, and grades.
Both touch the same database schema. We considered:
- Two repos (pipeline + web), each with its own copy of the schema.
- Monorepo with a shared
db/directory holding SQL migrations. - Schema-first ORM (Drizzle in TS, then introspect from Python; or SQLAlchemy in Python with TS clients consuming an OpenAPI spec).
Decision#
Monorepo. SQL migrations in db/migrations/ are the single source of
truth. Both Python and TypeScript follow that schema; neither owns it.
The TS side gets type safety via nflgrades gen-types, which introspects
the live DB and emits web/src/types/db.generated.ts. The Python side uses
raw SQL + pandas (no ORM models — see ADR 0002).
Consequences#
Easier:
- One PR can include a schema change + the pipeline change that uses it + the web change that displays it. No cross-repo coordination.
- New contributors (and AI agents) see the whole system in one tree.
docker compose up -dbrings up Postgres with migrations auto-applied, giving both halves a working environment instantly.
Harder:
- Repo grows two ecosystems' worth of tooling (npm + pip). Mitigated by keeping each in its own directory with its own README.
- Can't independently version the two halves. We don't need to.
Explicitly given up:
- Schema-first ORMs (Drizzle, Prisma) where the ORM file generates migrations. They'd push us to TS-first thinking, which is wrong here: the data pipeline is the primary writer and the analyst-friendly layer. See ADR 0002.
Python pipeline as installable package + raw SQL
- Status: Accepted
- Date: 2026-04-22
Context#
The Python side has to:
- Pull large DataFrames from
nfl_data_py - Compute statistical components and grades on those DataFrames
- Bulk-write results to Postgres
Two architectural questions:
- Loose scripts in
scripts/versus an installable package with a CLI entry point. - SQLAlchemy ORM models versus raw SQL + pandas
to_sql.
Decision#
Installable package. pipeline/ has a pyproject.toml defining the
nfl_grades package. After pip install -e ".[dev]", the user gets:
from nfl_grades.grading import sigmoidworks from anywherenflgradesCLI command (defined innfl_grades.cli:main)- Tests can
import nfl_gradeswithout path hacks - The package can be reused from notebooks, CI jobs, and scheduled runs
Raw SQL + pandas. No Base = declarative_base(), no class Player(Base).
Pipeline code uses:
pandas.read_sql/df.to_sqlfor bulk reads/writessqlalchemy.text("...")+ the engine fromnfl_grades.dbfor one-off statementsnfl_grades.db.session()context manager for transactional work
Consequences#
Easier:
- The CLI gives us one obvious entry point per stage (
nflgrades ingest,nflgrades grade, etc.) instead of a sprawl ofpython scripts/*.py. - Bulk DataFrame writes via
to_sqlare 10-100x faster than ORMadd_allfor the row counts we deal with (tens of millions of PBP rows). - Schema lives in SQL only (ADR 0001). No risk of "ORM model says X, DB says Y" drift.
Harder:
- No automatic relationship traversal (
player.seasons[0].grades). We don't need it — every analytical query is a SQL JOIN. - No Alembic auto-generation from models. We use a tiny custom migration runner instead — see ADR 0006.
Explicitly given up:
- ORM ergonomics. We're a data-analysis pipeline, not a CRUD app.
Data tier and `qualified` flag as first-class columns
- Status: Accepted
- Date: 2026-04-22
Context#
Our grades come in three data-quality tiers:
- Tier 1 (QB/RB/WR/TE): rich data, full pipeline incl. opponent adjustment
- Tier 2 (CB/S/EDGE): decent data
- Tier 3 (OL/iDL/off-ball LB/ST): proxy stats, directional only
We also have to handle players who fall below minimum-snaps thresholds: their season exists in data but the grade isn't reliable enough to display as if it were.
Options for representing both:
- Compute on read — the web app derives tier from position and
qualifiedfrom a snap-count join. - First-class columns on
season_grades—data_tier SMALLINTandqualified BOOLEANwritten by the pipeline, read directly. - Separate views per tier —
season_grades_tier1, etc.
Decision#
First-class columns on season_grades:
data_tier SMALLINT NOT NULL CHECK (data_tier BETWEEN 1 AND 3)qualified BOOLEAN NOT NULL DEFAULT TRUE
The pipeline sets both at write time. The web app reads them directly and shows a tier badge / "insufficient sample" pill without any joins or recomputation.
Consequences#
Easier:
- One query returns everything the UI needs to render a grade with full
context (
SELECT composite_grade, data_tier, qualified ...). - Tier-mapping logic lives in one place (the pipeline) and isn't duplicated between Python and TS.
- Filtering ("only show qualified Tier 1 grades") is a trivial WHERE clause with index support.
Harder:
- Changing the tier-mapping rules requires re-running grading to refresh the column. We accept this; tiers don't change often.
- A small amount of data redundancy: tier is implied by position. We accept this for query simplicity.
Explicitly given up:
- Computed-on-read flexibility. If we ever need per-user tier overrides (we won't), we'd have to add them as a separate table.
See also#
- ADR-0016 — TE
roleanddata_tier_reason(era + blocking-role merge) written alongsidedata_tieronseason_grades.
Normalize historical team abbreviations to current
- Status: Accepted
- Date: 2026-04-22
Context#
nfl_data_py uses the contemporary team abbreviation for each season's
data:
- 2016 Chargers are
SD, 2017+ areLAC - 2016–2019 Raiders are
OAK, 2020+ areLV - Pre-2016 Rams are
STL; from 2016 they'reLA(some sources useLAR) - A few sources sprinkle in
WSH,ARZ,BLT, etc.
If we naively join pbp.posteam = teams.abbr, 2016 Chargers rows silently
drop or fail FK constraints. We have to handle this somewhere.
Options:
- Store historical abbreviations as-is, display them as-of the season. ("In 2016, SD went 5-11" — but they're the Chargers, same franchise.)
- Normalize everything to the current abbreviation at ingestion time
via a
team_aliaseslookup table. - Use
nflverse-teampackage mappings at query time.
Decision#
Normalize to current abbreviation at ingestion. The team_aliases table
maps every historical abbr (and a few alternate spellings) to the current
team_id. Every current abbr aliases to itself, so the lookup is one
unconditional query.
The UI never displays SD or OAK. A 2016 Chargers depth chart is shown
under "Los Angeles Chargers" with a note that the team relocated.
Consequences#
Easier:
- All FK relationships work without special-casing historical abbrs.
- Cross-season queries ("show me all Chargers QBs since 2016") return the expected rows without UNIONs or OR clauses.
- Adding a new alias (some future relocation, or a new alternate spelling found in PFR data) is one INSERT.
Harder:
- Historical "purity" lost — a 2016 game line in our DB will say
LAC, notSD. We accept this; the franchise identity matters more than the city-of-record for player grading. - Need a small chunk of UI copy when showing pre-relocation seasons ("relocated 2017 from San Diego"). Cheap.
Explicitly given up:
- Showing "as the team was named at the time." If we ever build a historical game viewer, we'd surface that there.
Hand-written TS types with codegen guardrail
- Status: Accepted
- Date: 2026-04-22
Context#
The DB schema is the source of truth (ADR 0001). The Next.js side needs TypeScript types that match the schema. We considered:
- Hand-write everything. Simple but rots silently when migrations change.
- Switch to Drizzle/Prisma, define schema in TS, generate everything. Wrong direction — would make TS the source of truth.
- Auto-generate from the live DB with
kanel/pg-to-ts, replace hand-written types entirely. - Hand-write the public types, auto-generate the raw row types as a guardrail.
Decision#
Option 4. Two layers:
web/src/types/db.generated.ts— auto-generated frominformation_schemabynflgrades gen-types. Mirrors raw table shapes one-to-one. Never edited by hand. Committed to the repo so TS compiles without a live DB.web/src/types/index.ts— hand-written. Imports the generated row types and re-exports them with curated names, narrowed string-literal unions (e.g."AFC" | "NFC"instead ofstring), and view-shaped types for joins and aggregates.
In CI we'll run nflgrades gen-types --check which exits non-zero if the
generated file is stale. That's the guardrail: if you change a migration
without regenerating, CI catches it.
Consequences#
Easier:
- The schema can grow without TS imports breaking — add a column, run
gen-types, decide whether to expose it in
index.ts. - We get string-literal unions (
Conference,DataTier) where the raw Postgres type is justtext/smallint. Better than what any pure generator gives us. - Reviewers see the type changes in
index.tsPRs and can reason about the public API surface.
Harder:
- Two files to keep mentally aligned. Mitigated by
index.tsbeing short anddb.generated.tsbeing mechanical. gen-typesrequires a live DB. Acceptable since we have docker-compose.
Explicitly given up:
- Fully automatic types. We're trading a small amount of manual work for the ability to express domain types more precisely than introspection can give us.
Forward-only migrations with `schema_migrations` tracking
- Status: Accepted
- Date: 2026-04-22
Context#
We need a migration story. Options:
- Alembic. Standard for SQLAlchemy projects. Auto-generation from ORM models. We have no ORM models (ADR 0002), so the auto-generation isn't useful.
- Raw
psql -fper file, no tracking. Simple but easy to apply the same migration twice or skip one. - A tiny custom runner that tracks applied migrations in a
schema_migrationstable and refuses to re-apply or run modified files.
Decision#
Option 3. pipeline/src/nfl_grades/migrate.py (~80 lines) does:
- Creates
schema_migrations(filename PRIMARY KEY, sha256, applied_at)if it doesn't exist. - Lists
db/migrations/*.sqllexically. - For each file: skip if applied with matching sha; error if applied with different sha (someone edited an applied migration); apply otherwise.
- Each migration runs in its own transaction.
- Optional
--seedsflag also runsdb/seeds/*.sql(idempotent, re-runs every time).
Migrations are forward-only. To fix a bad migration, ship a new one
(0007_fix_bad_constraint.sql).
Consequences#
Easier:
- Deploying to Supabase/Neon is
nflgrades migrate. Same code as local. - New developers' first command is obvious and safe.
- Sha tracking catches "I edited an applied migration" mistakes loudly instead of silently going out of sync.
Harder:
- No down migrations. Acceptable: in 6 years of running this kind of pipeline, down migrations are almost always the wrong tool — you ship a forward fix instead.
- No model -> migration auto-generation. We don't want it; we'd rather hand-write SQL and review it.
Explicitly given up:
- Alembic ecosystem (branching, multiple heads, etc.). We have one head and we ship to it. If this ever stops being true, revisit.
Edge cases#
0001_init.sqlis currently editable because nothing has been applied anywhere yet. The moment it's applied to any environment, it becomes immutable.- The
schema_migrationstable is not itself in a migration file — the migration runner creates it on first invocation. That's intentional; bootstrapping a tracking table inside a tracked migration is a chicken- and-egg problem we don't need.
Pure-function grading math, DB I/O isolated to `ingest/`
- Status: Accepted
- Date: 2026-04-22
Context#
The grading pipeline has many moving parts: empirical Bayes shrinkage, opponent adjustment, z-score within position, inverse-noise composite weighting, sigmoid mapping to 0-100, Kalman smoothing across seasons. We need to be able to:
- Tune parameters interactively in notebooks
- Unit-test math without spinning up Postgres
- Re-run grading on cached/synthetic data
- Compare two grading variants side-by-side without committing one to disk
If grading code calls into the database, all of this gets harder.
Decision#
Modules under grading/, career/, components/, and adjust/ are pure
functions. They take pandas DataFrames and return pandas DataFrames. They
must not import from nfl_grades.db or nfl_grades.ingest.
DB I/O lives in two places only:
nfl_grades.ingest.*— reads fromnfl_data_py, writes to raw tables- The CLI commands in
nfl_grades.cli— orchestrate by reading from DB, passing DataFrames to the pure functions, writing results back
Concretely: grading/empirical_bayes.shrink(df, ...) returns a Series. The
CLI does df = pd.read_sql(...); shrunk = shrink(df); df.to_sql(...).
Consequences#
Easier:
- Tests for grading math are pure-Python, no fixtures, no test DB. See
pipeline/tests/grading/test_sigmoid.pyfor the pattern. - Notebooks can iterate on math by passing in any DataFrame, including hand-constructed ones for edge cases.
- A future "grade variant comparison" feature is just calling the same pure function with two parameter sets and diffing the outputs.
Harder:
- The CLI is responsible for the orchestration glue. That code is less interesting and less tested. Acceptable; it's mostly two-liners.
Enforcement:
- ADR-only for now. If we get tempted to add a DB call inside
grading/, the import would be the obvious red flag in code review. If this becomes a recurring problem, add animport-linterrule.
Sigmoid grade mapping with k=1.15, z=0->50, z=+2->90
- Status: Accepted
- Date: 2026-04-22
Context#
After computing a composite z-score per (player, season, position), we need to map it onto the 0-100 grade scale users see. Options:
- Linear rescale:
grade = 50 + 20*z, clipped to [0, 100]. Simple, but cliffs at the boundaries and stretches the middle. - Percentile-based:
grade = 100 * percentile_rank(z). Self-rescaling year over year (a "90" never means the same thing twice). - Sigmoid:
grade = 100 / (1 + exp(-k * (z - z0))). Smooth, bounded, monotonic, never rescales.
Decision#
Sigmoid with k=1.15 and z0=0. Implementation in
pipeline/src/nfl_grades/grading/sigmoid.py.
Parameters chosen so that:
- z = 0 -> grade = 50
- z = +1 -> grade ~= 76
- z = +2 -> grade ~= 91
- z = -2 -> grade ~= 9
Rough interpretation: a "90" is roughly 2 standard deviations above the positional mean — about the 97th percentile of qualified players.
Consequences#
Easier:
- Grades are stable across seasons. A 90 in 2018 means roughly the same thing as a 90 in 2024.
- Bounded [0, 100] without clipping artifacts.
- Smooth and monotonic — small z changes produce small grade changes.
- Same mapping works for every position.
Harder:
- Not directly interpretable as a percentile. We address this by storing
percentilealongsidecomposite_gradeonseason_grades. - Tuning k requires balancing "spread between elite players" (higher k) against "starters cluster near 50" (lower k). 1.15 is the current sweet spot from synthetic-data tuning; will be re-checked once we have real QB grades to eyeball.
Subject to revision:
- This is the v1 default. If face-validity tests after build step 2 say "the top 10 QBs are all 95+ and indistinguishable," we lower k. If they say "Mahomes is 78," we raise k. Document changes by superseding this ADR.
Raw nflverse data cached as parquet; only typed tables in Postgres
- Status: Accepted
- Date: 2026-04-23
Context#
Every ingest module pulls a DataFrame from nfl_data_py (play-by-play,
rosters, depth charts, NGS passing/receiving/rushing, weekly snap counts,
schedules) and eventually has to populate our typed tables (players,
player_seasons, depth_charts, stat_components, etc.).
The question: what happens to the raw DataFrame between the network call and the typed insert? Three real options:
- Direct ETL. Pull from
nfl_data_py, transform in memory, write typed rows. Discard the raw DataFrame. - Raw tables in Postgres. Persist the raw DataFrame to
raw_pbp,raw_rosters, etc. (text/jsonb-heavy schemas). Transform reads from those raw tables and writes to typed tables. - Parquet on disk. Cache the raw DataFrame to
pipeline/.cache/raw/{source}/{season}.parquet. Transform reads from parquet and writes typed rows to Postgres.
Things that matter for our project:
- PBP is large. ~50k rows × 300+ columns per season × 10 seasons is the bulk of our raw data. Most of those columns we never use.
- Iteration speed dominates. Tuning grade weights or the garbage-time
filter means re-running transforms many times per session. Re-downloading
PBP each time would kill the loop.
nfl_data_py.import_pbp_data([2024])takes ~30s; across 10 seasons that's 5 minutes per iteration. - Upstream churn happens.
nfl_data_pycorrects historical data and occasionally renames columns. A snapshot of "what we believed the schema was on date X" is valuable for debugging "why did this player's grade change?" - Postgres is for the product, not the archive. The web app, indexes, and analytical queries all target typed tables. Mixing 100M+ raw PBP rows in the same DB blows up backups, dump sizes, and query planner headroom.
- Pure-function math (ADR 0007). Transforms take DataFrames in and return DataFrames out. They don't care whether the source was a live API call, a parquet file, or a SQL query.
Decision#
Three-layer separation:
- Raw layer — parquet on disk. Every
nfl_data_pycall funnels through acache_or_fetch(source, season)helper that:- Returns
pd.read_parquet(...)if the file exists. - Otherwise calls the upstream function, writes the parquet, returns the DataFrame.
- Path:
pipeline/.cache/raw/{source}/{season}.parquet(already in.gitignore, configurable viaPIPELINE_CACHE_DIR).
- Returns
- Manifest — JSON sidecar.
pipeline/.cache/raw/manifest.jsonrecords{source, season, fetched_at, nfl_data_py_version, row_count, sha256}per file. Lets us detect upstream churn without re-downloading and surfaces stale caches innflgrades validate. - Typed layer — Postgres. Only schema-defined tables live in Postgres
(
db/migrations/*.sql). Noraw_*tables, nojsonbcolumns holding raw payloads.
CLI behavior:
nflgrades ingest <source> --seasons 2024,2025uses the cache by default.nflgrades ingest <source> --refreshignores the cache, re-fetches, and rewrites parquet + manifest.nflgrades ingest --refresh-stalere-fetches anything where the manifest shows the cachednfl_data_pyversion differs from the installed one.
Audit trail in Postgres: the existing pipeline_runs table records
each ingest invocation (stage='ingest:{source}', season, rows_written,
status). The pipeline_runs row says we ingested season X on date Y;
the parquet file holds what we actually saw.
Consequences#
Easier:
- Re-running grading on new parameters costs the transform time only — no
network, no waiting on
nfl_data_py. - Notebooks load raw with one line:
pd.read_parquet(cache_path("pbp", 2024)). - Reproducing a historical grade is
git checkout <sha>+ the parquet files; the database can be rebuilt from those two inputs alone. - Postgres backups stay small (~tens of MB for the typed product) instead of carrying GBs of raw PBP we never query in SQL.
- If we ever need ad-hoc SQL over raw, DuckDB reads the parquet directly
(
duckdb.sql("SELECT * FROM 'pipeline/.cache/raw/pbp/2024.parquet'")). We don't have to commit to that now.
Harder:
- Raw isn't backed up automatically. Acceptable: raw is regenerable
from
nfl_data_pyfor any season we cover. The cost of a wiped cache is one slow re-ingest, not data loss. - Two storage systems instead of one. Acceptable: the boundary is
obvious — anything inside
ingest/cache_or_fetch(...)reads/writes parquet, everything downstream reads from typed Postgres. - Detecting upstream column renames isn't automatic. The manifest catches
fetched-with-different-version; the schema-mapping code in
ingest/catches renamed-column loudly when it tries to access the missing key. Both are acceptable failure modes — loud and early.
Explicitly given up:
- Raw-in-DB convenience. Some teams like being able to
psqlinto araw_pbptable mid-debug. We're a pandas pipeline; you'd open a notebook andpd.read_parquetinstead. If this ever becomes painful, expose raw via a DuckDB-backed FDW or a thinrawschema — don't migrate the primary store. - Streaming ingest. Parquet is batch-oriented. We have no streaming use case (NFL data lands once a week); revisit if that changes.
Implementation notes (non-binding)#
cache_or_fetchlives innfl_grades.ingest._cacheand is the only module allowed to importnfl_data_py. Every concrete ingester (ingest/pbp.py,ingest/rosters.py, ...) calls it with its source key.- The manifest is rewritten atomically (write to
manifest.json.tmp, rename) so a Ctrl-C mid-update can't corrupt it. - Parquet uses pyarrow with default compression (snappy). Don't override unless we hit a real size or speed problem.
- Cache invalidation policy: never automatic. Refresh is always an explicit CLI flag. We'd rather work on stale data than silently re-run ingest under a developer.
Use nflreadpy (official nflverse) instead of nfl_data_py
- Status: Accepted
- Date: 2026-04-23
- Supersedes: implicit choice of
nfl_data_pyin earlier scaffolding
Context#
The original pipeline scaffolding picked nfl_data_py as the data-source
client (mentioned in data-sources.md, pipeline/README.md, and
pyproject.toml's [ingest] extra). This was the de-facto standard for
Python access to nflverse data for several years.
Two things forced a re-evaluation:
- Python 3.13 incompatibility.
nfl_data_py 0.3.3(the latest release, shipped in early 2024) caps its dependencies atnumpy<2.0. Our stack is Python 3.13 withnumpy>=2.1(which is required for Python 3.13 wheels — there are nonumpy<2wheels for cp313).pip install ".[ingest]"fails withResolutionImpossible. nflreadpyexists and is the official successor. Released September 2025 by Tan Ho (nflverse maintainer),nflreadpyis a Python port ofnflreadr(the canonical R package for nflverse data). It pulls from the samenflverse-dataGitHub releases — the actual data source is identical.
Comparison:
| Aspect | nfl_data_py 0.3.3 | nflreadpy 0.1.5 |
|---|---|---|
| Maintainer | Cooper Adams (community) | Tan Ho (nflverse core team) |
| Last release | Feb 2024 | Nov 2025 (5 releases in 3 months) |
| Python 3.13 | broken (numpy<2 pin) | supported, classifier present |
| DataFrame backend | pandas | polars (with .to_pandas() method) |
| Data source | nflverse-data releases | nflverse-data releases (same) |
| Caching | none | built-in (memory or filesystem) |
| API surface | import_pbp_data, import_seasonal_rosters, ... | load_pbp, load_rosters, ... |
| Coverage | PBP, NGS, rosters, snaps, etc. | PBP, NGS, rosters, snaps, FTN, contracts, draft, injuries, ... (superset) |
The "Beta" status warning on nflreadpy is real but the API mirrors
nflreadr exactly, so the contract is well-defined and the underlying
data files are the same we'd be reading either way.
Decision#
Use nflreadpy for all nflverse data access. Specifically:
pipeline/pyproject.toml[ingest]extra:nflreadpy>=0.1.5,polars>=1.0,pyarrow>=18.0.- All ingest modules (
ingest/pbp.py,ingest/rosters.py, etc.) callnflreadpy.load_*functions. - The
cache_or_fetchhelper from ADR 0009 wrapsnflreadpycalls and converts polars → pandas at the boundary so the rest of the pipeline stays pandas-based (we have no reason to rewrite the math layer in polars yet). nflreadpy's built-in cache is disabled (NFLREADPY_CACHE=off); we control caching ourselves via parquet files- manifest per ADR 0009. Two cache layers would be redundant and the manifest needs the raw network fetch to record correctly.
- Function-name mapping is documented in
docs/data-sources.md(import_pbp_data→load_pbp,import_seasonal_rosters→load_rosters, etc.).
Consequences#
Easier:
- Python 3.13 just works. We keep the modern numpy/pandas/scipy stack without downgrading.
- We're tracking the same library as the R-side nflverse community uses, which means R-language docs and examples translate almost directly.
- Active development: bugs and data updates land in weeks, not years.
- Polars is faster than pandas for the kinds of bulk reads ingest does (10-50M PBP rows). Even though we convert to pandas, the read+parse step is faster.
Harder:
- We pull in
polars(~50MB) andpyarrow(~30MB) at the ingest extra. Acceptable: ingest is a power-user/CI workload, not a thin import. - Polars → pandas conversion at the ingest boundary is one extra
.to_pandas()call. Effectively free (zero-copy via Arrow when possible). - "Beta" library risk: API could shift between 0.x releases. Mitigated by
pinning a minimum version and keeping the wrapper layer (
_cache) thin enough that an API change is one-file fix.
Explicitly given up:
nfl_data_pyecosystem familiarity. Function-name muscle memory needs retraining (import_pbp_data→load_pbp). Net cost: a doc table.- Pandas-native reads. We could keep using pandas directly via
pd.read_parqueton nflverse parquet URLs, but then we'd be re-implementing the discovery/versioning logic thatnflreadpyalready handles. Not worth it.
What this changes in the repo#
pipeline/pyproject.toml[ingest]extradocs/data-sources.md— function-name mapping,nflreadpyreferencespipeline/README.md— replacenfl_data_pymentionspipeline/src/nfl_grades/ingest/__init__.pydocstringAGENTS.md— convention #5 already cites ADR 0009; nothing to change beyond the data-source namedocs/adr/0009— still correct (parquet caching strategy is source-agnostic); leave it alone
What this does NOT change#
- The grading methodology, schema, ADRs 0001–0008.
- ADR 0009's three-layer separation. Parquet on disk, manifest sidecar, typed Postgres — all independent of which Python client we use to fetch.
Store a thin `plays` table in Postgres, not the full PBP fat table
- Status: Accepted
- Date: 2026-04-23
Context#
The nflverse PBP feed (nflreadpy.load_pbp) returns ~49,500 rows × 372
columns per season. It's the input to every grading formula. ADR-0009
already decided that raw source data lives as Parquet on disk, with only
typed queryable tables in Postgres. The question now is what shape the
Postgres-side plays table takes.
Three options:
- No plays in Postgres. Grading reads Parquet each run. Web app can never drill into individual plays.
- Thin plays table — ~40 columns we actually use: identifiers, situation, classification, player attribution, outcomes.
- Fat plays table — store all 372 columns.
Decision#
Option 2. Create a plays table with ~40 curated columns, documented
below. The full 372-column Parquet remains the source of truth on disk
(pipeline/.cache/raw/pbp/<season>.parquet), and any analysis that needs
columns not in the table can re-read the Parquet directly.
Column selection#
Columns chosen for one of four reasons:
- Required by the v1 grading formula (QB composite: EPA/db, CPOE, success rate + garbage-time filter).
- Required by likely v1.x grading expansions (RB RYOE context, WR separation context, defensive attribution).
- Required by UI drill-down ("top 10 EPA plays for player X").
- Cheap to keep and likely needed soon (penalty, air_yards, yac).
Everything else — Elias IDs, no_huddle flags, yardline strings, 200+ tracking-derived columns — stays in Parquet only.
Columns#
See db/migrations/0003_create_plays.sql for the authoritative schema.
Summary:
| group | columns |
|---|---|
| identifiers (PK) | game_id, play_id |
| game context | season, season_type, week, game_date |
| teams (text abbrs, not FK) | posteam, defteam, home_team, away_team |
| situational | qtr, down, ydstogo, yardline_100, score_differential, game_seconds_remaining, half_seconds_remaining, wp |
| classification | play_type, qb_dropback, pass_attempt, rush_attempt, sack, qb_scramble, qb_spike, qb_kneel, aborted_play, two_point_attempt, penalty |
| player attribution (gsis_id text) | passer_player_id, rusher_player_id, receiver_player_id, sack_player_id, interception_player_id |
| outcomes | yards_gained, epa, wpa, cpoe, success, air_yards, yards_after_catch, complete_pass, incomplete_pass, interception, fumble_lost, pass_touchdown, rush_touchdown, touchdown |
| debugging | play_desc (renamed from nflverse desc to avoid SQL reserved-word friction) |
Total: ~42 columns.
Team and player references: strings, not FKs#
posteam/defteamstay asTEXT(not FK toteams). Historical team abbreviations (STL,OAK,SD,LApre-rebrand) already have normalization coverage viateam_aliases; pushing FK semantics into the plays table would force rewriting team abbrs during ingest and fight against the source.*_player_idcolumns store the rawgsis_idasTEXT. Joining toplayers.gsis_idis one-line SQL. Deferred advantages: we can ingest plays before rosters for that season (hasn't happened yet, but is a real recovery story if rosters breaks), and we don't have to manage FK cascades when a player is deleted.
Indexes#
(season, season_type)— partitions most grading queries.(passer_player_id, season),(rusher_player_id, season),(receiver_player_id, season)— for the "feature extraction" queries that pull one player-season's plays at a time.
Size and storage#
- ~50k rows/season. 10 seasons of history = ~500k rows.
- ~40 columns, mostly nullable small numerics + a few text keys.
- Estimated ~80 MB for 10 seasons in Postgres (10x smaller than the Parquet cache, since we're dropping 330 columns).
- Well inside "don't bother partitioning" territory.
Consequences#
Easier:
- Grading reads
SELECT ... FROM plays WHERE season=? AND passer_player_id=?with no pandas overhead. - UI player pages can show "top 10 EPA plays" with a cheap indexed query.
- New stat components for existing positions are small SQL additions — no new ingest needed.
Harder:
- Adding a new column we later need means a new migration + a full re-ingest of affected seasons. We accept this: the column list above is conservative and covers the build plan through career grading.
- Two sources of truth for raw PBP (Parquet + Postgres). The Parquet file is canonical; if the Postgres table disagrees we re-ingest.
Explicitly given up:
- Per-play tracking fields (time-to-throw per play, pressure tags) — those live in NGS / FTN, not PBP, and are ingested separately.
- The 300 "everything else" PBP columns — fumble recovery IDs, drive numbers, kicker yards etc. Available via the Parquet cache if needed for ad-hoc analysis.
References#
- ADR-0009: Raw data cached as Parquet, typed tables in Postgres.
docs/exploration/2026-04-23-pbp.md(to follow this ADR) — probe output that anchored this column selection.
Store NGS as three tables, not one unified fact table
- Status: Accepted
- Date: 2026-04-23
Context#
Next Gen Stats (NGS) arrives via nflreadpy.load_nextgen_stats(stat_type=...)
in three flavors:
- passing (29 cols):
avg_time_to_throw,aggressiveness,completion_percentage_above_expectation(NGS's CPOE),avg_air_yards_to_sticks, plus derived efficiency numbers. - rushing (22 cols):
rush_yards_over_expected_per_att,efficiency,percent_attempts_gte_eight_defenders,avg_time_to_los. - receiving (23 cols):
avg_separation,avg_cushion,avg_yac_above_expectation,percent_share_of_intended_air_yards.
Column overlap across the three: only the keys
(player_gsis_id, season, season_type, week, team_abbr) and the
"display" fields we drop. Zero substantive stat overlap.
NGS coverage: 2016 → present. Earlier seasons have no NGS data at all.
Options:
- Three tables:
ngs_passing,ngs_rushing,ngs_receiving, each with its native columns. - One unified
ngs_stats(player_id, season, week, component_name, value)EAV table: normalizes across stat types. - One wide table with all 29+22+23 columns, most nullable.
Decision#
Option 1. Three tables, each holding its source columns verbatim
(minus display dupes like player_first_name). Feature extraction joins
whichever table the position needs.
Rationale#
- Column overlap is zero. An EAV table would force every query to
filter by
component_name, losing type safety and pushing schema into strings. No analytic win. - Query shape matches the storage shape. QB grading reads one row
per passer from
ngs_passing. RB reads one row fromngs_rushing. Not joining across stat types — no benefit to unifying them. - Size is trivial. ~600 QB-season-weeks + ~600 RB-season-weeks + ~1400 WR/TE-season-weeks × 10 seasons × ~74 columns total = well under 100 MB. Three tables don't hurt.
- Rejected Option 3 (wide table): half the row would be nulls for any given position. Ugly, misleading query surface, same storage win as Option 1 once you exclude nulls.
Grain and keys#
Each table: one row per (player, season, season_type, week, team).
week = 0is the season summary row (nflverse convention). The grading pipeline readsWHERE week = 0for per-season metrics.week > 0preserved for future weekly UI / trend charts.season_typeis kept because NGS includes postseason rows (weeks 19, 20, 21, 23 on the nflverse week axis).team_idis part of the PK because a player traded mid-season gets separate NGS rows per team (the season-summary row too — each team segment gets its own summary).
Team normalization#
team_abbr in the source is the contemporary abbreviation (LAR,
LAC, LV, etc.). We resolve via team_aliases at ingest time to
get team_id, same as every other ingest. See ADR-0004.
Player mapping#
player_gsis_id in NGS is the nflverse gsis id, which we already use
as the canonical identifier on players.gsis_id. No name matching
required.
Minimum season#
season >= 2016 is enforced in ingest. Earlier seasons have no NGS;
the grading pipeline handles their absence via data_tier (ADR-0003).
What we store#
Every NGS-specific column, verbatim. No pruning — NGS is small and
future formula variants may want max_air_distance or
percent_attempts_gte_eight_defenders even if v1 doesn't.
We drop: player_first_name, player_last_name, player_display_name,
player_short_name, player_position, player_jersey_number. All
already available on players / player_seasons / depth charts.
Consequences#
Good:
- Natural query shape:
SELECT * FROM ngs_passing WHERE week=0 AND season=2024. - Adding new NGS columns (if nflverse exposes them) is a single
ALTER TABLEper affected stat type — no EAV-row-count explosion. - Type-safe columns in generated TypeScript.
Trade-offs:
- Three ingest code paths (shared via a dispatcher — see
ingest/ngs.py). - Adding a new stat type (hypothetical
ngs_defense) is a new migration rather than "just insert rows".
References#
- ADR-0003 —
data_tierfor missing historical coverage - ADR-0004 — team abbr normalization
docs/exploration/2026-04-23-ngs.md(when populated) — schema probe
QB v1 grading formula
- Status: Accepted (v1.1 revision — 2026-05-14)
- Date: 2026-04-23
- Updated: 2026-05-14 (v1.1 success_rate weight lowered — see Revision History)
- Supersedes: None
- Formalizes:
docs/grading/qb-v1-proposal.md(strawman approved by the user 2026-04-23)
Context#
First concrete grading formula the pipeline needs to compute. Scope limited to the QB position so we can ship a full vertical slice (ingest → features → grades → UI) and iterate on the formula once real numbers are on screen.
Decision#
Composite#
grade = sigmoid(composite_z)
composite_z = 0.50 * z(shrunk_EPA_per_dropback)
+ 0.25 * z(shrunk_CPOE)
+ 0.25 * z(shrunk_success_rate)
z()= within-position, within-season standardization.sigmoid()= existinggrading/sigmoid.py, tuned soz = 0 → 50,z = +2 → ~90,z = -2 → ~10.- z-score mean/SD computed from qualified QBs only (see below).
Per-component definitions (before shrinkage)#
| Component | Raw value | Sample space |
|---|---|---|
qb_epa_per_dropback | mean of plays.epa | dropbacks (post-filter) |
qb_cpoe | mean of plays.cpoe | pass attempts only (CPOE is null on sacks/scrambles) |
qb_success_rate | mean of plays.success | dropbacks (post-filter) |
Filter#
A play counts toward the grade iff ALL:
plays.season_type = 'REG'
plays.qb_dropback = TRUE
plays.aborted_play = FALSE
plays.two_point_attempt = FALSE
NOT garbage_time
Garbage-time (ADR-0013 formalizes the proposal's rule):
garbage_time =
(qtr >= 4 AND ABS(score_differential) > 21)
OR (qtr = 4 AND game_seconds_remaining < 300
AND ABS(score_differential) > 14)
Chosen over the wp < 0.05 OR wp > 0.95 convention because nflverse
WP is aggressive about locking in late-game outcomes, and we'd rather
err on the side of keeping a legitimate play than dropping one.
Empirical Bayes shrinkage#
Per component, before z-scoring:
shrunk = (n * raw + k * mu_league) / (n + k)
n= sample size for that component (dropbacks for EPA + success rate; pass attempts for CPOE — CPOE is null on sacks/scrambles so we use only the plays where it's defined).mu_league= league mean of the raw component among all QBs, weighted by their sample size (volume-weighted, not simple average).kshrinkage strength:k = 150for EPA/db and success ratek = 100for CPOE (lower variance, less shrinkage needed)
Qualified threshold#
qualified = TRUEiffn_dropbacks >= 200in the regular season.- All QBs with any dropbacks get a row in
season_grades, but those below the threshold havequalified = FALSE— the UI can de- emphasize them. - Unqualified QBs still get shrunk / z-scored so their grade is on the same 0–100 scale.
Position assignment#
A player grades as QB iff they're in player_seasons.position_played = 'QB' for that season. If a player appears at multiple positions, they
only grade at each position they occupied. Non-QBs with a passing play
(e.g. wildcat RB throws) don't get a QB grade — the QB feature query
joins against player_seasons.position_played.
What opponent adjustment?#
None for v1. Deferred. The composite runs off raw EPA, no defense-strength normalization. Revisit in v2 once face-validity feedback shows whether it's missing.
Confidence#
season_grades.confidence is set to min(1, n_dropbacks / 300).
Rough proxy — 300 dropbacks is roughly half a full-season starter's
workload; anyone at/above that gets confidence = 1.
Data tier#
Per ADR-0003:
- 2016+: tier 1 (full PBP + NGS available)
- 2006–2015: tier 2 (PBP available, no NGS — not relevant to the v1 formula since we don't use NGS)
- pre-2006: tier 3 (no EPA model — cannot grade with this formula)
For now we only compute grades for seasons that have PBP ingested. The
data_tier column on season_grades records which tier the grade
belongs to.
Consequences#
Testability: Each stage (filter, shrinkage, z-score, composite, sigmoid) is a pure function on a DataFrame. Component tests verify the math, integration tests verify the top-10 list looks sane.
Iteration: If we decide CPOE is overweighted, that's a single
coefficient change in grading/qb.py. If we want to add opponent
adjustment later, it's a new column on stat_components — no schema
change. The formula is a library, not an API.
Superseded when: We add NGS-derived components (time-to-throw, aggressiveness), opponent adjustment, or a defensibly-tuned inverse- variance weighting. Those go in v2 and get their own ADR.
References#
docs/grading/qb-v1-proposal.md— the strawman user approved- ADR-0003 — data tiering for missing historical coverage
- ADR-0007 — originally sketched inverse-noise weighting; v1 skips this intentionally for explainability
db/migrations/0001_init.sql—stat_componentsandseason_gradestables (pre-existing)
Revision History#
v1.1 (2026-05-14) — qb_success_rate weight lowered (correlation finding)#
Lowered qb_success_rate from +0.25 → +0.10. Kept EPA at +0.50, CPOE at +0.25. Sum |w| drops 1.00 → 0.85; combiner normalizes so EPA effectively grows from 50% → 59% of the formula. CPOE keeps its 29% share. Success rate is now 12%.
Why: Two independent audits converged on this finding.
-
Cross-position pairwise correlation audit (2026-05-14): found
qb_epa_per_dropback↔qb_success_rateat Pearson r = +0.883 — the strongest redundancy in the entire grader system. Mathematically related (success_rate ≈ "fraction of plays with positive EPA"; EPA per dropback = mean EPA). Seedocs/grading/audits/2026-05-14-correlation.md. -
QB exhaustive candidate audit (2026-05-14): scored every plausible QB candidate stat (19 candidates from nflvs / NGS / PFR) against four criteria — reliability (YoY r), cross-sectional discrimination, independence from existing components, and downstream predictive validity (next-year Pro Bowl). Confirmed:
- success_rate has the lowest validity of the three current components (Pro Bowl r = +0.130 vs EPA +0.158 and CPOE +0.146).
- success_rate is the most redundant (+0.848 with EPA, +0.726 with CPOE).
- No new candidates emerged as compelling adds — TD rate had highest validity (+0.260) but +0.729 with EPA; INT rate was noise; NGS metrics were style markers with weak validity; PFR bad_throw_pct duplicated CPOE; OL-conflated metrics (sack_rate, pressure_rate_faced) couldn't be cleanly attributed.
Full audit log:
docs/grading/audits/2026-05-14-exhaustive-qb.md. Article-defensible — every candidate has a documented verdict.
Validity check after re-grade: QB composite vs next-year Pro Bowl correlation improved from +0.237 to +0.244 across 2017-2023. The change is real signal, not just a redistribution.
Face-check 2024: top 5 unchanged (Lamar Jackson, Jared Goff, Joe Burrow, Tua Tagovailoa, Josh Allen). Biggest movers up are explosive-but-inconsistent QBs (Jalen Hurts +2.46, Derek Carr +3.37, Justin Herbert +2.37); biggest movers down are clean-but-unexplosive operators (Matthew Stafford −2.91, Kirk Cousins −2.39, Kyler Murray −2.72). Coherent — the formula now leans further into explosive-play EPA over consistency, matching the redundancy structure success_rate was double-counting.
Known limitation surfaced (not addressed in v1.1): the audit found qb_rush_epa_per_rush to be an independent signal (YoY 0.398, max_r −0.08 with existing) measuring mobile-QB value — a real skill not in v1 (Lamar/Allen/Hurts get incomplete credit). Weak Pro Bowl validity (+0.04) and per-rush attribution issues (scramble vs designed-rush) kept it out for now. Documented for revisit in a future revision once we have better data.
Tooling: shipped via the new preview/regrade workflow (docs/grading/iteration-workflow.md). End-to-end ~2 minutes mechanical work.
RB v1 grading formula
- Status: Accepted (v1.4 revision — 2026-05-14)
- Date: 2026-04-24
- Updated: 2026-05-14 (v1.1 catch_pct removed, v1.2 rec_EPA weight lowered — see "Revision History")
- Supersedes: None
- Companion to: ADR-0013 (QB v1). Same pipeline shape, different components and per-skill sample sizes.
Context#
Second concrete grading formula. QB v1 shipped (ADR-0013); we're extending the same architecture (extract → shrink → z → composite → sigmoid) to RB. Two things make RB harder than QB:
- Role variation is huge. Derrick Henry (280 carries / 20 targets) and Christian McCaffrey (220 / 100) are both elite by very different profiles. A naive single-composite formula would wrongly penalize a thumper's "bad" receiving or reward a pass- catching back's "easy" rushing.
- Raw RB stats reflect a lot of non-RB stuff (OL quality, box counts, play-action, scheme). NGS RYOE and YAC-over-expected already try to strip this out, so they deserve meaningful weight.
We choose to handle (1) with usage-aware empirical Bayes
shrinkage — a pure thumper's receiving components shrink hard
toward the league mean (because their n_targets is small relative
to the shrinkage k) and contribute close to zero to the composite.
No explicit role detection is needed.
Decision#
Composite#
grade = sigmoid(composite_z)
composite_z = 0.28 * z(shrunk_ryoe_per_attempt)
+ 0.18 * z(shrunk_rush_epa_per_attempt)
+ 0.14 * z(shrunk_rush_success_rate)
+ 0.18 * z(shrunk_rec_epa_per_target)
+ 0.12 * z(shrunk_yac_over_expected_per_rec)
+ 0.05 * z(shrunk_catch_pct)
- 0.05 * z(shrunk_fumble_rate)
Rush 60% / Rec 35% / Security 5%. Fumble rate enters with a negative sign (fumbles are bad).
z()= within-position, within-season standardization (same helper as QB; mean/SD computed from qualified RBs only).sigmoid()= existinggrading/sigmoid.pytuned soz = 0 → 50,z = +2 → ~90.
Per-component definitions (before shrinkage)#
| Component | Raw value | Sample (n) | Source |
|---|---|---|---|
rb_ryoe_per_attempt | NGS rush_yards_over_expected_per_att | carries | ngs_rushing (week=0) |
rb_rush_epa_per_attempt | mean of plays.epa on rushes | carries | plays |
rb_rush_success_rate | mean of plays.success on rushes | carries | plays |
rb_rec_epa_per_target | mean of plays.epa on targets | targets | plays |
rb_yac_over_expected_per_rec | mean of plays.yards_after_catch - plays.xyac_mean_yardage on completions | receptions scored by xYAC model (n_rec_with_xyac) | plays (nflfastR xYAC) |
rb_catch_pct | n_receptions / n_targets (from plays, filter-matched) | targets | plays |
rb_fumble_rate | fumble rate per touch (any fumble by ball carrier) | total touches | plays |
Pre-adjusted flag: rb_ryoe_per_attempt and
rb_yac_over_expected_per_rec are already context-adjusted by their
upstream models (NGS's RYOE model and nflfastR's xYAC model
respectively). When opponent adjustment lands in v2, these two
components must be flagged so we don't double-adjust.
Catch-% source: NGS's load_nextgen_stats("receiving") only
publishes rows for WR/TE — RBs are never included regardless of
target volume. For v1 we derive catch % directly from plays:
n_receptions / n_targets, with the same garbage-time / 2-pt
filter as the rest of the receiving components. No expected-catch
baseline is applied (none is available); we accept the limitation
because RB target diet is relatively uniform (mostly short routes)
and the component's weight is only 5%.
YAC-over-expected source: same NGS-receiving RB gap — we
instead use nflfastR's xyac_mean_yardage column published on
every completion in plays. For each RB reception with a non-null
xyac_mean_yardage, the residual yards_after_catch - xyac_mean_yardage is the RB's YAC over expected on that play. We
average across the RB's receptions (filter matches the rest of the
receiving components). Coverage on RB completions in the modern
era is >99% (≈0.9% null in 2024), so sample size effectively equals
n_receptions. See v1.1 refinement section below.
Fumble rate: computed from plays.fumble (any fumble by the
ball carrier, not just ones recovered by the defense). Counted on
both rushing and receiving plays within the same per-skill filters
as the production metrics. See v1.1 refinement section below.
Filter#
A rushing play counts toward the rushing components iff ALL:
plays.season_type = 'REG'
plays.rush_attempt = TRUE
plays.rusher_player_id IS NOT NULL
plays.qb_kneel IS NULL OR plays.qb_kneel = FALSE
plays.qb_scramble IS NULL OR plays.qb_scramble = FALSE -- scrambles aren't RB production
plays.two_point_attempt IS NULL OR plays.two_point_attempt = FALSE
NOT garbage_time
A receiving play counts toward the receiving components iff ALL:
plays.season_type = 'REG'
plays.pass_attempt = TRUE
plays.receiver_player_id IS NOT NULL
plays.two_point_attempt IS NULL OR plays.two_point_attempt = FALSE
NOT garbage_time
Garbage-time rule is identical to ADR-0013:
garbage_time =
(qtr >= 4 AND ABS(score_differential) > 21)
OR (qtr = 4 AND game_seconds_remaining < 300
AND ABS(score_differential) > 14)
Position assignment#
A player grades as RB iff players.position = 'RB'. We grade from
the master players table (not player_seasons.position_played) so
that a rookie who changed teams mid-season still gets one grade per
player, not one per team stint.
Non-RBs with rushes (scrambling QBs, WR jet-sweepers, gadget TEs)
don't get an RB grade — the feature query joins on
players.position = 'RB'.
Empirical Bayes shrinkage#
Per component, before z-scoring:
shrunk = (n * raw + k * mu_league) / (n + k)
where mu_league is the volume-weighted RB league mean (summed over
qualified and unqualified RBs, same convention as QB v1).
k per component (picked so n == k means "half-shrunk toward
league mean"):
| Component | n column | k |
|---|---|---|
rb_ryoe_per_attempt | carries | 100 |
rb_rush_epa_per_attempt | carries | 100 |
rb_rush_success_rate | carries | 100 |
rb_rec_epa_per_target | targets | 40 |
rb_yac_over_expected_per_rec | receptions scored by xYAC (n_rec_with_xyac) | 30 |
rb_catch_pct | targets | 40 |
rb_fumble_rate | total touches | 200 |
The large k on fumble rate is deliberate — fumble rate (even with
the recovery coin-flip removed by switching from fumble_lost to
fumble) still has weak year-over-year reliability (~r=0.1-0.2),
so we shrink hard.
Handling missing data#
Some RBs are below NGS's volume thresholds and won't have
ngs_rushing season-summary rows. Our joins are LEFT JOINs and the
missing metrics come through as NaN with n = 0. Similarly, an RB
with no receptions has NaN receiving metrics.
Policy: before combining into the composite, any NaN component z-score is replaced with 0 (neutral). This covers three distinct "no data" cases under a single rule:
- A pure thumper with
n_targets = 0has NaN receiving z-scores. - A pass-game specialist with 15 carries (under NGS's rushing
volume threshold for RYOE) has a NaN z for RYOE even though
their
n_carries > 0. - Some RBs may be missing NGS rushing rows for the season entirely (e.g. rookies whose first week was postseason).
All three collapse to "no evidence on this skill = assume league average on this skill". The alternative — renormalizing composite weights per-player to drop missing components — would re-introduce role-aware weighting, which we explicitly wanted to avoid.
The stat_components.z_score column keeps the true NaN for these
rows so the UI can render "—" rather than "0.0" and be honest about
what we don't know. Only the composite calculation substitutes 0.
Qualified thresholds#
Three separate qualification concepts, because RBs have two skills:
| Threshold | Rule | Purpose |
|---|---|---|
| Grade at all | touches >= 30 | Excludes fringe players we can't say anything meaningful about |
| Composite qualified | touches >= 120 | "Real contributor" — appears in main leaderboard |
| Rushing sub-grade qualified | carries >= 80 | Rushing sub-grade displays; else "—" |
| Receiving sub-grade qualified | targets >= 40 | Receiving sub-grade displays; else "—" |
120 touches is roughly 7-8 touches/game over a full season — half
a full-season bell cow's workload, or all of a receiving specialist
like Ekeler. Tunable if the face-check shows too many marginal
committee backs at the top.
All backs with touches >= 30 get a season_grades row; the
qualified column distinguishes them.
Sub-grades#
The season_grades row holds the composite grade only. Sub-
grades (rushing / receiving) are computed at read time in the
web app by combining the already-z-scored component rows in
stat_components. No schema change.
Rushing sub-grade z =
(0.28*z_ryoe + 0.18*z_rush_epa + 0.14*z_rush_success) / (0.28 + 0.18 + 0.14) then sigmoid to 0-100.
Receiving sub-grade z =
(0.18*z_rec_epa + 0.12*z_yac_over_exp + 0.05*z_catch) / (0.18 + 0.12 + 0.05) then sigmoid to 0-100.
A sub-grade renders as "—" when the sample-size threshold for that
skill isn't met. This is purely a UI convention — the composite
grade in season_grades is unaffected.
Confidence#
season_grades.confidence = min(1, touches / 250). 250 touches is
roughly a full-season starter's workload; anyone at/above that gets
confidence = 1.
Data tier#
Per ADR-0003:
- 2016+: tier 1 (PBP + NGS available; full formula computes).
- Pre-2016: out of scope for v1. The formula depends on NGS components (RYOE, YAC-over-expected, catch %) for 45% of weight. Backfilling a pre-NGS fallback is deferred.
Consequences#
Testability: each stage is a pure function (same as QB); unit tests verify the "n=0 → z=0" neutralization, the sub-grade threshold gating, and that dual-threat backs outrank specialists.
Web app: the existing leaderboard + player detail pages render
RBs as soon as season_grades has rows. A position switcher on the
home page is a one-component follow-up (bundle with WR/TE).
Iteration: weight and k changes are single-coefficient edits
in weights.py. Adding broken-tackle-rate from PFR is a new
component row, no schema change.
v1.1 refinement (2026-04-22)#
Two caveats from the original v1 were resolved by adding two
columns to plays (migration 0005_add_fumble_and_xyac_to_plays)
and switching the RB grader's data sources:
-
Fumble rate now uses
plays.fumblerather thanplays.fumble_lost. Fumble-lost depends on who recovers (a near-coin-flip), making it strictly noisier than true fumble rate. The change is source-only — the weight (-0.05), the large shrinkagek(200), and the ball-carrier attribution rules are unchanged. -
YAC-over-expected now sourced from
plays.xyac_mean_yardage(nflfastR's xYAC model output on each completion) rather thanngs_receiving.avg_yac_above_expectation. Root cause: NGS's receiving product publishes zero RB rows regardless of target volume, so the NGS-based component collapsed to a NaN-then- neutralized 0 for effectively every RB, silently wasting its 12% composite weight. The xYAC column covers >99% of modern-era RB completions, so the component is now active signal.
Both changes preserve the existing composite weights, shrinkage
constants, qualification thresholds, and pre_adjusted flags — the
data sources change, the formula does not. Pre-adjusted remains
True for the YAC component (xYAC is still a per-play, context-
aware model — opponent adjustment in v2 must still skip this
component).
The stat_components.component_name strings remain the same
(rb_fumble_rate, rb_yac_over_expected_per_rec), preserving the
public contract with the web app.
Deferred#
- Opponent adjustment: same deferral as QB v1. When added, the
RYOE and YAC-over-expected components must be flagged as
pre_adjusted: Trueto avoid double-adjustment. - Broken-tackle rate from PFR — valuable skill signal, but reliability needs cross-year validation before we weight it.
- Red-zone / goal-line efficiency — small sample, mostly usage- driven, skipped.
- Two-point conversion efficiency — same reasoning.
- 20+ yard breakaway rate — potentially distinct signal from EPA, but correlation is high enough that we're dropping it for v1. Revisit if breakaway-archetype backs grade unfairly low.
- Route participation / target share as a graded input — no routes-run data ingested yet.
- Forced-fumble attribution, recoveries-in-pileups — deferred to a defensive-grading pass.
- Usage labels ("Feature / Committee / Specialist") derived from snap share. Nice UI add, not a grading change. v1.5.
References#
- ADR-0013 — QB v1 grading formula (same architecture)
- ADR-0003 — data tiering
- ADR-0011 — thin plays table (updated by migration 0005 to include
fumbleandxyac_mean_yardage) - ADR-0012 — NGS three-table layout (rushing used; receiving intentionally not joined for RB grading)
Revision History#
v1.1 (2026-05-14) — rb_catch_pct removed as noise#
Removed rb_catch_pct (+0.05). Bumped rb_yac_over_expected_per_rec from +0.12 to +0.15 to absorb the receiving-side weight slot. Sum |weights| now 0.98 (vs 1.00).
Why catch_pct out:
- YoY r across 2020-2024 oscillated around zero: −0.015, +0.035, +0.120, −0.322 (mean ≈ −0.05). Same shape as WR fumble rate that was removed in the parallel WR v1.1 audit.
- Catch_pct correlates 0.61 with
rb_rush_success_rate— partial redundancy when it has signal at all. Backs who run consistently also catch checkdowns consistently; the metric was mostly stylistic, not a stable skill. - Range 0.45-1.00 with std 0.099 in 2024 — variation exists but doesn't persist year-over-year, so it's measuring noise not skill.
Why YAC-OE bumped (not redistributed elsewhere):
- Receiving share would have dropped from 35% to 30% with a pure removal. Bumping YAC-OE by 0.03 keeps the receiving / rushing balance approximately intact (~33% receiving).
- YAC-OE has stronger YoY stability and lower overlap with other components than catch_pct did, so the weight is going to a higher-quality signal.
What we audited and rejected:
- NGS rushing additions (
efficiency,rush_pct_over_expected,avg_time_to_los,percent_attempts_gte_eight_defenders): all either redundant with RYOE (|r|>0.74 in two cases) or non-skill usage markers. None added independent signal. - MTF / Missed Tackles Forced (proposed by external grading discussion): not available in nflverse — FTN charting only has passing-play flags (
is_drop,is_catchable_ball, etc.), not rushing-play forced-tackles. PFF-only stat. - Stacked-box adjustment as a coefficient: rejected because NGS's RYOE expected-yards model already accounts for defender count in the box; layering an explicit box-count adjustment would double-correct.
- Removing
rb_fumble_rate: considered (YoY r ≈ +0.07, weak) but kept — RB fumble distribution is genuinely discriminating (median 2, max 7, 54% have 2+) and within-season correlation with composite is meaningful (−0.27). Different from WR fumbles which were pure noise.
Face-check after revision: 2024 top 10 essentially unchanged at the top (Henry, Saquon, Gibbs). Josh Jacobs moved into the top 10 (previously sat just outside). 2023 top 10 with Christian McCaffrey #1 — unchanged.
Audit methodology: Skill-tree mapping → correlation analysis on 2024 qualified RBs (n=41) → YoY noise check across 2020-2024 → face-check. Same playbook as WR v1.1. See memory: project_rb_v1_1_research.md and reference_grading_methodology.md for the full process.
v1.2 (2026-05-14) — rb_rec_epa_per_target lowered (noise), rb_yac_over_expected_per_rec bumped#
Lowered rb_rec_epa_per_target from +0.18 → +0.05; bumped rb_yac_over_expected_per_rec from +0.15 → +0.28. Sum |weights| unchanged at 0.98. Receiving share of the formula stays at 33% — just rebalanced within receiving.
Why rec_EPA weight slashed (not removed): The cross-position YoY audit run after WR v1.2 / TE v1.1 found rb_rec_epa_per_target had mean YoY r = +0.027 across 2016-2025 — the lowest of any component in the entire system. RB per-target EPA is largely a function of QB choice + game state (RBs are checkdown receivers) rather than RB skill. At +0.18 weight, this was the worst noise-to-weight ratio in any shipped grader.
Kept at +0.05 (not removed entirely) because (a) it still captures some outcome signal on the rare deep-target RB plays, (b) removing entirely would require schema changes to stat_components which are out of scope for a weight-only revision, and (c) at +0.05 the noise contribution to the composite is bounded.
Why YAC-OE absorbs the freed weight (not RYOE): YAC-OE has YoY r = 0.205 — modest but real signal, ~7× the rec_EPA signal. Moving the weight here preserves the formula's 60/33/5 rush/receive/security shape; the alternative (moving to RYOE) would have shifted the formula to rushing-heavy and disadvantaged pass-catching backs (McCaffrey-archetype). Within-receiving rebalance is the more conservative choice.
Face-check: 2024 top 5 essentially unchanged (Henry/Gibbs/Saquon/Bucky/Bijan). Biggest movers down: De'Von Achane (−5.5), Chase Brown (−5.4), James Cook (−4.3) — all receiving-EPA-heavy backs. Biggest movers up: Jonathan Taylor (+7.2), Josh Jacobs (+5.1), Joe Mixon (+3.9), James Conner (+3.2) — all YAC-OE-strong workhorses. Story is coherent: weight shifted from noise to signal, and players sort accordingly.
Audit methodology + tooling note: This was the first revision shipped via the new preview + regrade workflow. End-to-end shipping time (preview → weights.py edit → web sync → regrade 10 seasons → face-check) was ~30 seconds. See memory/reference_formula_iteration_workflow.md. Audit data in memory/project_cross_position_yoy_audit.md.
v1.3 (2026-05-14) — rush_success_rate lowered (exhaustive audit)#
Lowered rb_rush_success_rate from +0.14 → +0.05. Sum |w| 0.98 → 0.89. Combiner renormalizes — surviving components gain a few percentage points of effective share.
Why: RB exhaustive candidate audit (docs/grading/audits/2026-05-14-exhaustive-rb.md) scored 19 candidates. Two key findings:
-
Same EPA-vs-success-rate redundancy as QB and WR.
rb_rush_success_ratecorrelated +0.713 withrb_rush_epa_per_attempt(structurally — success_rate ≈ fraction of plays with positive EPA). Validity (+0.079) was the lowest of any current RB component. Combination of high redundancy + low validity → reduce weight. -
rb_pfr_yards_after_contactemerged as the strongest candidate in any audit so far. Validity +0.192 — higher than ANY current component (max was RYOE at +0.130). Modest YoY (+0.313), moderate overlap with RYOE (+0.596). Real RB skill (post-contact yardage / breaking tackles / falling forward) not in the current formula.
The success_rate reduction shipped immediately (pure weight tweak via Path A workflow). The yards_after_contact addition is queued as v1.4 because it requires a new ingest module for pfr_advstats rush data (a Path B schema change). Tracked in docs/grading/pending.md.
Validity gate: RB composite vs next-year Pro Bowl correlation +0.243 → +0.247 post-regrade. Modest improvement (smaller than QB/WR's improvements because v1.2's rec_EPA reduction already partially mitigated the redundancy). Other positions unchanged.
Face-check 2024: Top 4 unchanged (Henry, Gibbs, Saquon, Bucky Irving). Same explosive-vs-consistent reshuffling pattern as QB v1.1 and WR v1.3: Joe Mixon +4.17 and Najee Harris +2.90 (explosive) rose; Bijan Robinson −3.29, Montgomery −2.97, Allgeier −3.16 (consistent operators) dropped.
What the audit confirmed about RB v1.2 decisions: rb_rec_epa_per_target had YoY r = +0.010 in this audit — even worse than the +0.027 we measured in the cross-position YoY audit that drove v1.2's reduction. The v1.2 decision to lower this from 0.18 → 0.05 was validated.
What the audit confirmed about NGS rushing: all 5 NGS rushing candidates (efficiency, time_to_los, rush_pct_over_expected, pct_eight_defenders, ryoe_per_att) rejected. Either duplicate of RYOE (ngs_ryoe_per_att at +0.961), inverse of RYOE (ngs_efficiency at −0.653), or style markers with zero validity. Confirmed all the v1.1 research findings using the new validity criterion.
v1.4 (2026-05-14) — added rb_yards_after_contact_per_carry (Path B schema change)#
Added new component: rb_yards_after_contact_per_carry at weight +0.10. New sum |w| = 0.99.
| Component | v1.3 | v1.4 | Share v1.3 | Share v1.4 |
|---|---|---|---|---|
rb_ryoe_per_attempt | 0.28 | 0.28 | 32% | 28% |
rb_rush_epa_per_attempt | 0.18 | 0.18 | 20% | 18% |
rb_rush_success_rate | 0.05 | 0.05 | 6% | 5% |
rb_rec_epa_per_target | 0.05 | 0.05 | 6% | 5% |
rb_yac_over_expected_per_rec | 0.28 | 0.28 | 32% | 28% |
rb_yards_after_contact_per_carry (NEW) | — | 0.10 | — | 10% |
rb_fumble_rate | −0.05 | −0.05 | 6% | 5% |
Why: the RB exhaustive audit identified this as the highest-validity candidate of any audit so far (validity +0.192 — higher than any current RB component, higher than the best QB candidate, comparable to WR's strongest signal). YoY r +0.313 (modest, comparable to RYOE +0.246). Moderate overlap with RYOE (+0.596 — RYOE includes pre-contact OL yards; yards_after_contact isolates the post-contact RB-skill portion). Real new dimension.
Schema change required (Path B):
- New migration:
db/migrations/0015_pfr_rb_rush.sql—pfr_rb_rushtable (carries, yards_after_contact, yards_before_contact, broken_tackles per player-season; 2018+). - New ingest source:
pfr_advstats_rushregistered inpipeline/.../ingest/_cache.py. - New ingest module:
pipeline/.../ingest/pfr_rush.py— same pattern aspfr_dl.py. - New CLI:
nflgrades ingest pfr-rb-rush --season YYYY. - Updated
rb.pygrader: addedpfr_rush_aggCTE that LEFT JOINspfr_rb_rush, computesyards_after_contact_per_carry = yards_after_contact / pfr_carries.
Pre-2018 handling: PFR rush data starts 2018. For RB seasons 2016-2017 the yards_after_contact_per_carry component is NULL → NaN-neutralized to 0 in composite (same pattern as WR drop_rate's pre-2022 handling).
Validity gate passed: RB composite vs next-year Pro Bowl correlation +0.247 → +0.259 post-ship (+0.012 improvement on top of v1.3's +0.004 improvement; total +0.016 since v1.2). The strongest validity gain of any single weight change in any audit. Other positions unchanged.
Face-check 2024 top 10: Henry (YAC 2.80), Gibbs (2.40), Bucky Irving (2.69, rose 4→3), Saquon Barkley (1.97, dropped 3→4 — surprisingly low YAC for the OPOY candidate, his value was explosive pre-contact runs not second-effort yardage), Jacobs, Cook, Bijan, Conner, Mason, Allgeier (2.60). Top names unchanged; Bucky Irving's elite YAC is rewarded; Saquon's grade drops slightly because his style produces fewer post-contact yards than Henry's power running or Irving's tackle-breaking.
Audit log: docs/grading/audits/2026-05-14-exhaustive-rb.md — full candidate table including the v1.4 candidate's four-criterion scores and the reasoning for shipping.
This is the first Path B ship that emerged from the exhaustive audit framework. Demonstrates that the methodology can find new components worth adding (not just weight tweaks on existing ones). The lesson for the broader project: the exhaustive audit is creating value beyond redundancy diagnostics — it surfaces missing skills.
WR v1 grading formula
- Status: Accepted (v1.3 revision — 2026-05-14)
- Date: 2026-04-22
- Supersedes: None
- Companion to: ADR-0013 (QB v1), ADR-0014 (RB v1). Same pipeline shape (extract -> shrink -> z -> composite -> sigmoid), different components, filters, and qualification thresholds.
Context#
Third concrete grading formula. QB v1 and RB v1 shipped; we're extending the same architecture to WR. Three things distinguish WR grading from the prior two:
- WRs have one skill, not two. There's no RB-style dual-skill split (rushing + receiving), so there's one composite and no sub-grades in v1. "Route runner vs YAC monster" is interesting UI data viz but not a separate qualification bucket.
- NGS receiving publishes WRs cleanly (unlike RBs, which NGS
excludes). We get
avg_separationandavg_yac_above_expectationon essentially all qualified WRs from 2016+. - Target earn rate is a real signal for WRs (unlike for RBs, where carries are decreed by scheme). WRs partly earn their targets by winning routes and forcing the QB's eye. This is a new component with no RB analog.
The grade is meant to answer "how well did this WR play the receiving role this season?" — separated from usage-driven accumulators (total yards, touchdowns, target share as a volume stat).
Decision#
Composite#
grade = sigmoid(composite_z)
composite_z = 0.35 * z(shrunk_rec_epa_per_target)
+ 0.27 * z(shrunk_yac_over_expected_per_rec)
+ 0.10 * z(shrunk_separation)
+ 0.10 * z(shrunk_target_earn_rate)
+ 0.08 * z(shrunk_success_rate_per_target)
- 0.05 * z(shrunk_fumble_rate)
Sum of magnitudes = 0.95. The composite combiner normalizes by
sum of magnitudes (not signed sum); fumble contributes at its
designed 5.3% share (0.05 / 0.95). This invariant is locked by
test_signed_weights_normalize_by_magnitude in
pipeline/tests/grading/test_composite.py and further reinforced
by test_wr_v1_weights_example which uses the exact
WR_V1_WEIGHTS dict.
Rough shape:
-
62% outcome-based: EPA/target 35% + YAC-over-expected 27%
-
28% process + usage: separation 10% + target earn rate 10% + success rate 8%
-
5% ball security: fumble rate (negative)
-
z()= within-position, within-season standardization against qualified WRs only (same helper as QB and RB). -
sigmoid()=grading/sigmoid.py, z=0 -> 50, z=+2 -> ~90.
Why these weights#
- EPA at 35%, not 40%. A single metric at 40% gives any systematic bias (QB quality, scripted touches, YAC-heavy offense) too much leverage. 35% keeps EPA the biggest contributor without dominating the composite.
- YAC at 27%. Highest-reliability WR signal after EPA. xYAC pre-adjusts for coverage state at the catch, so this is close to pure WR skill.
- Target earn rate at 10%, not 22%. Target share is structurally correlated with team environment (top QB, pass-heavy scheme, weak WR2 competition, weak TE/RB pass game). These confounds don't wash out across a season; they persist for players in stable situations. 10% captures the "QB looks at you" signal without letting offensive environment drive a fifth of the grade.
- Separation at 10%, not 15%. Process metric, not outcome; inflated by easy targets (screens, hitches); NGS measures at-catch rather than at-throw. Keep it modest.
- Success rate at 8%. Diversifies efficiency measurement away from pure EPA, but it's partly role-contaminated (slot checkdowns on 3rd-and-medium have a different success-rate baseline than outside verticals on 1st-and-10). 8% is a compromise — not 5% (which underweights a second efficiency lens), not 10% (which overweights a role-biased metric). Flagged as a face-check watch item: if slot specialists systematically outgrade deep threats, dial this back first.
- Catch-rate-over-expected dropped entirely. Every version of
this from public data is either QB-contaminated (aggregated
plays.cpoeper receiver rewards pairing with accurate QBs) or role-contaminated (raw NGS catch % punishes deep threats and rewards screen/flat receivers). Omitting a component is an honesty signal — PFF has proprietary charting for catchable targets; we don't. Surface raw catch % on the player page as context, keep it out of the composite. - Fumble rate at -5%. Same rationale as RB v1.1: rare event, low YoY reliability, shrink hard.
Per-component definitions (before shrinkage)#
| Component | Raw value | Sample (n) | Source | Pre-adjusted |
|---|---|---|---|---|
wr_rec_epa_per_target | mean of plays.epa on targets | targets | plays | No |
wr_yac_over_expected_per_rec | mean of plays.yards_after_catch - plays.xyac_mean_yardage on completions with non-null xYAC | n_rec_with_xyac | plays (nflfastR xYAC) | Yes |
wr_separation | avg_separation | targets | ngs_receiving (week=0) | Yes |
wr_target_earn_rate | n_targets / n_team_pass_att_active | team pass attempts while active | plays | No |
wr_success_rate_per_target | mean of plays.success on targets | targets | plays | No |
wr_fumble_rate | rate of plays.fumble per reception | receptions | plays | No |
Target earn rate denominator: n_team_pass_att_active is the
sum of posteam's regular-season pass attempts across the set of
(posteam, game_id) pairs that appear in the WR's own target
plays. This handles mid-season trades cleanly — each game's
denominator is its correct team's pass volume. The "had >=1 target"
proxy for active may slightly under-count games where the WR
played but wasn't targeted; for qualified WRs this is rare.
Fumble denominator = receptions (not targets): WRs only touch the ball on completions. Keeps fumble rate comparable across possession WRs and deep threats.
Pre-adjusted flag: wr_yac_over_expected_per_rec and
wr_separation are already context-adjusted by their upstream
models. When opponent adjustment lands in v2, these components
must be flagged so we don't double-adjust.
Filter#
A receiving play counts toward WR components iff ALL:
plays.season_type = 'REG'
plays.pass_attempt = TRUE
plays.receiver_player_id IS NOT NULL
plays.two_point_attempt IS NULL OR plays.two_point_attempt = FALSE
NOT garbage_time
Identical to the RB v1 receiving filter — reused verbatim from
grading/filters.py::RB_REC_FILTER_SQL. Garbage-time rule is the
one defined in ADR-0013.
The team-pass-attempts aggregate for the earn-rate denominator uses the same filter so numerator and denominator are consistent (both count REG-season, non-garbage, non-2pt pass attempts).
Position assignment#
A WR grade is issued iff players.position = 'WR'. A WR running a
jet sweep doesn't get rushing credit — this is a receiving grade
only. A TE/RB running routes out of the backfield doesn't get a WR
grade; they belong in their own position's pipeline.
Empirical Bayes shrinkage#
Per component, before z-scoring:
shrunk = (n * raw + k * mu_league) / (n + k)
where mu_league is the volume-weighted WR league mean (summed
over qualified and unqualified WRs, same convention as QB/RB v1).
k per component:
| Component | n units | k |
|---|---|---|
| EPA per target | targets | 50 |
| YAC over expected per rec | receptions scored by xYAC | 30 |
| Separation | targets | 40 |
| Target earn rate | team pass attempts while active | 200 |
| Success rate per target | targets | 50 |
| Fumble rate | receptions | 100 |
Separation's k (40) is slightly below the other per-target components (50) because NGS separation has higher year-over-year reliability than raw per-play efficiency metrics. Target earn rate uses its natural denominator (team pass attempts) rather than games — the EB formulation shrinks toward league-mean target share weighted by the number of observations, which is the correct statistical framing. k=200 team pass attempts is roughly 35% of a team's regular-season pass volume.
Handling missing data#
Same policy as RB v1 (see ADR-0014 "Handling missing data"): any
NaN component z-score is replaced with 0 (neutral) before entering
the composite. stat_components.z_score keeps the true NaN so the
UI can render "-" rather than "0.0".
Practically, this matters most for:
- WRs under NGS's separation volume threshold (rookies with partial seasons, or below the volume NGS publishes). Separation is NaN; z is NaN; composite substitutes 0.
- A WR with 0 completions (only happens at the extreme low-volume end) has NaN YAC and NaN fumble rate.
The alternative — renormalizing composite weights per-player to drop missing components — would re-introduce role-aware weighting, which we explicitly want to avoid.
Weight normalization invariant#
The composite combiner normalizes by sum of magnitudes
(sum(abs(w))), not signed sum. A player at z=+1 on every
component (including fumble rate — where z=+1 means "fumbles a
lot") gets composite_z = (0.35 + 0.27 + 0.10 + 0.10 + 0.08 -
0.05) / 0.95 ≈ 0.894, and fumble penalizes at exactly its
designed 5.3% share rather than being amplified by a smaller
signed-sum denominator.
This is locked by test_signed_weights_normalize_by_magnitude
(added during RB v1.1) and by the new
test_wr_v1_weights_example which exercises the actual
WR_V1_WEIGHTS dict.
Qualification thresholds#
Two qualification concepts:
| Threshold | Rule | Purpose |
|---|---|---|
| Grade at all | targets >= 20 | Excludes fringe WRs we can't say anything meaningful about |
| Composite qualified | targets >= 50 | Rotational WR3 or better; appears in main leaderboard; defines z-score population |
~3/game over a full season is roughly the floor for "this player got real route time." Tunable if face-check shows too many marginal WR3s at the top or too many clear WR1s falling below.
All WRs with targets >= 20 get a season_grades row; the
qualified column distinguishes them.
Confidence#
season_grades.confidence = min(1, targets / 100). 100 targets is
~6/game — "real starter usage" rather than WR1 workload
(which would be ~120-140+). Pegging full confidence here gives
most healthy starters confidence = 1 and reserves the fractional
band for genuine part-season / rotational players.
Data tier#
Per ADR-0003:
- 2016+: tier 1 (PBP + NGS available; full formula computes).
- Pre-2016: out of scope for v1. The formula depends on NGS components (separation, xYAC availability) for 37% of weight. A pre-NGS fallback is deferred; call it a v2 concern.
Validation expectations#
Expect WR composite year-over-year Pearson r on 2+-season
samples in the band 0.45 - 0.60.
Interpretation triggers:
- Below 0.45 — methodology problem. Most likely a process component (separation or success rate) dominating noise over EPA/YAC. Investigate weight distribution and per-component reliability.
- 0.45 - 0.60 — the expected band. WR production is genuinely more defense-dependent than QB production, and we don't have CB matchup adjustment in v1.
- Above 0.65 — suspicious. Likely means we're accidentally measuring usage (target volume, team context) rather than skill. Investigate whether target earn rate is pulling the stability or whether separation's metric-stability is doing more work than intended.
QB v1 for comparison was in the 0.60 - 0.70 band; WR's lower ceiling is a data limit (no CB matchup data), not a grading failure. Don't chase the QB number by tuning weights.
Consequences#
Testability: each stage is a pure function, same as prior
positions. Unit tests verify NaN neutralization, that a pure
separator outranks a non-separator with the same efficiency,
that the fumble penalty actually subtracts, and that the
composite normalization constant matches the hand-computed value
from WR_V1_WEIGHTS.
Web app: the existing leaderboard + player detail pages
render WRs as soon as season_grades has rows. A position
switcher on the home page is a separate follow-up (currently
hardcoded to QB; RB and WR both pending surfacing).
Iteration: weight and k changes are single-coefficient
edits in weights.py. Adding a new component (say, separation
at-throw once it becomes publicly available) is a new SQL CTE
and a new row in the weights dicts; no schema change.
Deferred (v1.1+)#
- Target-per-route-run — the clean v1.5 upgrade to target earn rate, replaces the "team pass attempts while active" proxy with a true "routes run" denominator. Requires routes-run data (PFF/FTN); not ingested.
- Team-context-adjusted target earn rate — regress target share on team pass volume + QB EPA, grade on the residual. ~30 lines of code, a v1.1 candidate if face-check shows earn rate rewarding bad-team-WR1s too generously.
- Drop rate —
playscan't cleanly isolate drops from defended passes. Requires explicit drop charting. - Slot vs outside split — no alignment data ingested. Face- check will tell us if the one-scale approach systematically biases one archetype.
- Contested catch rate — not available in public tracking data.
- Red-zone / goal-line efficiency — small sample, mostly role-driven.
- Opponent adjustment, team-level — same deferral as QB/RB
v1.
wr_yac_over_expected_per_recandwr_separationmust be flaggedpre_adjusted=Trueto avoid double-adjustment. - CB matchup adjustment — the v2+ work that would push YoY
rfrom the 0.45-0.60 band toward QB-level 0.60-0.70. Requires per-target defender charting.
References#
- ADR-0013 — QB v1 grading formula (same pipeline architecture)
- ADR-0014 — RB v1 grading formula (shares receiving machinery, same NaN neutralization policy, same xYAC source for YAC-over- expected)
- ADR-0012 — NGS three-table layout (receiving table used for
avg_separation) - ADR-0011 — thin
playstable (withfumbleandxyac_mean_yardageadded by migration 0005) - ADR-0003 — data tiering
Revision History#
v1.1 (2026-05-14) — drop_rate in, fumble_rate out#
Replaced wr_fumble_rate (−0.05) with wr_drop_rate (−0.08). Sum |weights| now 0.98.
Why fumble out: YoY r for WR fumble rate across 2020-2024 oscillated around zero (−0.26, +0.09, −0.40, +0.27, mean ≈ −0.07). 56% of qualified WRs had 0 fumbles in 2024, 90% had ≤1 — sample too small for meaningful grading. Confirmed noise. Fumbles still penalized implicitly via rec_epa_per_target (a fumble play has negative EPA).
Why drops in: Drop rate is the only WR-skill gap our v1 didn't measure (deferred at v1 release because "plays can't cleanly isolate drops from defended passes"). FTN charting (now ingested as ftn_receiving_charting, available 2022+) flags is_drop and is_catchable_ball per play, joined to PBP receiver_player_id. Correlation audit (2024 qualified WRs, n=89) showed drop_rate has max |r|=0.21 against every other component — fully independent signal. 2024 face-check matched consensus (best hands: McLaurin, Shakir, Kupp, Addison, Hopkins, ARSB; worst hands: George Pickens 6 drops, Allen Lazard 7, Xavier Legette 5).
Weight sizing: −0.08 chosen because drops have known data-quality caveats (FTN more conservative than PFF; some "0 drops on 40 catchable" entries are borderline). Bigger than the prior fumble weight because drops are ~10× more frequent and meaningfully discriminating; not so large that FTN's noise overwhelms the rest of the formula.
Pre-2022 seasons: No FTN data, so the wr_drop_rate component is NaN-neutralized to 0 contribution. 2016-2021 WR grades are effectively the v1 formula with fumble removed. This still works because z-scoring happens within-season and the player's grade is determined by the 5 remaining components.
New schema: Migration 0014_ftn_receiving_charting.sql creates ftn_receiving_charting (player_id, season, catchable_balls, drops, contested_balls, created_receptions). Ingest module: pipeline/src/nfl_grades/ingest/ftn_receiving.py.
Research notes: Audit also considered WOPR (correlated 0.95 with target_share — redundant), RACR (target-depth artifact, not skill), NGS avg_cushion and avg_intended_air_yards (usage markers, not skills), and contested catch rate (correlated −0.71 with separation). None of these added meaningful independent signal. YPRR and CROE were considered but neither has source data in nflverse.
v1.2 (2026-05-14) — lower drop_rate weight from −0.08 to −0.05#
Triggered by the TE v1.1 audit. When auditing whether to add te_drop_rate for TE v1.1, we ran the YoY noise check on WR drop_rate after the fact — which v1.1 had skipped. Result across 2022-2025 qualified WRs (catchable_balls ≥ 50):
| Pair | n | r |
|---|---|---|
| 2022→2023 | 42 | +0.27 |
| 2023→2024 | 40 | −0.12 |
| 2024→2025 | 37 | +0.10 |
Mean YoY r = +0.09 — statistically indistinguishable from the WR fumble rate we removed (mean −0.07). By the methodology's own rule (reference_grading_methodology.md Step 3: |r| < 0.20 → "weight tiny ≤ 0.05 or remove"), the v1.1 weight of −0.08 was over-weighted. The original v1.1 justification leaned on correlation independence + face-check, both of which still hold — but those only justify inclusion at light weight, not heavy weight.
Why not remove entirely: the metric still has real cross-sectional discrimination (std ~3%, max ~16%), captures a skill no other component covers, and face-checks correctly. At small per-player denominators (catchable median ~75), YoY r is mechanically depressed by measurement error even if the underlying skill is stable — so the face-check is stronger evidence than YoY r here. Light weight (−0.05) captures the real signal without overclaiming.
Sum |weights| changes 0.98 → 0.95. Other components unchanged. Re-graded WRs 2016-2025 on Neon. Expected impact: ≤1 grade-point shift per player for most WRs; no major rank shuffles. The biggest deltas land on extreme outliers (heavy droppers ranked slightly higher; clean-hands WRs ranked slightly lower).
Symmetric with TE v1.1, which also lands te_drop_rate at −0.05 for the same reason. See ADR-0016 and memory/project_te_v1_1_research.md for the full self-audit.
Follow-up: before any more positions ship, run the YoY noise check across every component in every shipped position. The WR drop_rate gap means other components added without YoY verification may also be over-weighted. Tracked in memory/project_pending_audits.md.
v1.3 (2026-05-14) — target_earn_rate bumped, success_rate lowered (exhaustive audit)#
Changes:
- Bumped
wr_target_earn_ratefrom +0.10 → +0.15 (effective share 11% → 15%). - Lowered
wr_success_rate_per_targetfrom +0.08 → +0.05 (effective share 8% → 5%).
Sum |weights| changes 0.95 → 0.97. Other components unchanged.
Why: WR exhaustive candidate audit (docs/grading/audits/2026-05-14-exhaustive-wr.md) scored 22 plausible WR candidates against the four-criterion framework (reliability + cross-sectional discrimination + independence + predictive validity). Two clear findings:
-
wr_target_earn_rateis the strongest signal in the formula. Pro Bowl validity r = +0.285 (highest of any candidate, current or proposed). YoY r = +0.682 (highest among current components, second-highest in the audit). At v1.2's 0.10 weight (11% of formula), it was underweighted relative to its signal strength. -
wr_success_rate_per_targethas the same EPA-vs-success-rate redundancy as QB. max |r| = +0.746 withwr_rec_epa_per_target— same mathematical relationship that drove the QB v1.1 reduction earlier the same day (success_rate ≈ fraction of plays with positive EPA; EPA per target = mean). Validity moderate (+0.159) but redundant. Bounded at 0.05 weight.
Validity gate passed: WR composite vs next-year Pro Bowl correlation improved from +0.280 → +0.300 post-regrade (the audit's recommended decision criterion). The shift toward earn_rate (highest-validity signal) measurably aligns the formula more closely with consensus-elite recognition.
Face-check 2024: Top 5 unchanged (A.J. Brown, Chris Godwin, Marvin Mims, Khalil Shakir, Puka Nacua). Biggest movers up are alpha-target receivers: Malik Nabers +4.13, Keenan Allen +2.90, Cooper Kupp +2.83, George Pickens +2.78, CeeDee Lamb +2.45, DJ Moore +2.40, Davante Adams +2.38. Biggest movers down are rotational role players or deep threats with low target share: Devaughn Vele −2.68, Tutu Atwell −2.14. Coherent.
What the audit also found and rejected (per the article-defensible methodology — documented in the audit log):
wr_separationhas near-zero Pro Bowl validity (+0.003) despite strong YoY (+0.560). Either it's universal at qualified-WR level, or Pro Bowl voters reward production not process. Kept at current weight (don't reverse-engineer validity).wr_pfr_broken_tackle_per_rec— measures YAC-skill via tackle-breaking, independent of existing components (max_r +0.40 with YAC-OE), modest YoY (+0.298), modest validity (+0.144). Real WR skill not in the formula but signal is weak — documented as future consideration (same shape as theqb_rush_epa_per_rushmobile-QB gap).- 18 other candidates rejected with documented reasoning — duplicates (NGS YAC-OE, target_share, WOPR), style markers (intended_air_yards), defense-driven metrics (NGS cushion), or noise (PFR drop_pct). Full table in the audit log.
Tooling: shipped via the validity-gated preview/regrade workflow (iteration-workflow.md). End-to-end ~2 minutes mechanical work plus the ~half day of audit analysis.
TE v1 grading formula
- Status: Accepted (v1.2 revision — 2026-05-14)
- Date: 2026-04-23
- Companion to: ADR-0013 (QB), 0014 (RB), 0015 (WR); ADR-0003 (data tier); ADR-0009 (parquet cache)
Context#
TE grades must reflect receiving only in v1: public data does not support a
repeatable blocking grade (no PFF-style charting). Role labels and
data_tier_reason communicate what the number measures (see Role and
data_tier below).
Decision — composite (tier 1, full six components)#
Same structure as WR v1 with separation at 7% (WR uses 10%). NGS separation is WR-coverage-geometry calibrated; TE-vs-LB/S matchups are noisier in the same metric — downweight, do not drop.
| Component | Weight |
|---|---|
te_rec_epa_per_target | 0.35 |
te_yac_over_expected_per_rec | 0.27 |
te_separation | 0.07 |
te_target_earn_rate | 0.10 |
te_success_rate_per_target | 0.08 |
te_drop_rate (v1.1) | -0.05 |
Sum of magnitudes |w| = 0.92 (signed sum 0.82; composite normalizer uses
sum of absolute weights — see test_signed_weights_normalize_by_magnitude
and TE tests in test_composite.py). v1 used te_fumble_rate at −0.05 in
the slot now held by te_drop_rate; same magnitude, different component.
See "Revision history" below.
The earlier "0.95" figure in this ADR was a copy-paste artifact from WR v1
(WR has separation at 0.10 → WR |w| = 0.95); TE separation is downweighted
to 0.07 for NGS-calibration reasons, giving |w| = 0.92.
YAC weight = WR (27%): do not increase TE YAC weight on intuition alone; if TE YAC YoY correlation meaningfully exceeds WR YAC in validation, consider v1.1 weight shift with evidence.
Tier 2 — role = blocking_te#
Target earn rate is role-dominated for Y-heavy TEs. Omit earn from the
composite; redistribute 0.10 to EPA and YAC in proportion 0.35∶0.27
(→ 0.406 and 0.314). Other components unchanged. The component row for
te_target_earn_rate is still written with raw / shrunk / z;
stat_components.used_in_composite = false for that row.
Because the redistribution preserves magnitude, tier-2 has the same
|w| = 0.92 and signed sum 0.82 as tier-1 — on an all-z=1 TE the two
dicts both produce 0.82 / 0.92 ≈ 0.8913. The dicts differ by where the
earn mass lands, not by total weight.
Filters, features#
- Receiving filter: same as WR/RB receiving (
RB_REC_FILTER_SQL). - Features: plays +
ngs_receiving(week=0) for separation;playsfor xYAC-based YAC-over-expected;player_seasonssummedsnaps_offensefor role. - Fumble denominator: receptions.
Qualification#
- 15 targets minimum to emit a grade row.
- 40 targets for
qualified. - Confidence =
min(1, targets / 70).
Shrinkage (per-position k)#
TE target earn k = 100 team pass attempts (vs WR 200) — smaller
cross-player dispersion in earn rate. Other components align with WR (EPA 50,
YAC 30, separation 40, success 50, fumble 100).
Role buckets#
receiving_te: target share ≥ 0.10 (targets / offensive snaps, season).balanced_te: 0.05 ≤ share < 0.10, or low-snap / low-rate catch-alls.blocking_te: share < 0.05 and offensive snaps ≥ 200.
data_tier and data_tier_reason#
Era leg: _era_tier_for_season in grading/era_tier.py → (tier, reason) with
reason = era_pre_ngs when tier ≥ 2 from era alone.
TE merge (grading-only):
- If
role == blocking_teand era tier 1 →data_tier = 2,data_tier_reason = role_blocking_te. - If
role == blocking_teand era tier ≥ 2 → keep era tier,data_tier_reason = era_and_role. - Else → era
(tier, reason)only.
Non-TE positions: role NULL; data_tier / data_tier_reason from era tuple
only.
Schema (migration 0006)#
season_grades.role, season_grades.data_tier_reason,
stat_components.used_in_composite.
Pure blocking TEs (< 15 targets)#
No season_grades row. Team/roster UI must not hide these players when built
(see plan / UX note).
Validation#
Target TE YoY r band 0.40–0.55 (slightly below WR); interpret like ADR-0015.
Deferred#
Blocking grade, alignment splits, red-zone split, target-per-route earn rate, CB matchup, etc.
References#
pipeline/src/nfl_grades/grading/te.pypipeline/src/nfl_grades/grading/era_tier.pydocs/adr/0003-data-tier-and-qualified-as-first-class-columns.md
Revision history#
v1.1 (2026-05-14) — drop_rate in, fumble_rate out#
Replaced te_fumble_rate (−0.05) with te_drop_rate (−0.05). Same magnitude in the same slot. Sum |weights| unchanged at 0.92.
Why fumble out: YoY r for TE fumble rate across 2020-2025 oscillated around zero (+0.01, +0.20, +0.07, −0.25, +0.36, mean ≈ +0.08). ~50% of qualified TEs had 0 fumbles in a season; max 3. Same noise pattern as WR fumble rate. Fumbles still penalized implicitly via te_rec_epa_per_target (a fumble play has negative EPA).
Why drops in: Drops are the only TE-skill gap v1 didn't measure. FTN charting (ftn_receiving_charting, 2022+) already integrated for WR v1.1. TE drop_rate YoY r across 2022-2025: +0.33, +0.02, +0.04 (mean +0.13). Modest signal, just below the 0.20 threshold for "meaningful." Correlation with other TE components (2024, n=34) all below |r|=0.40 — independent skill dimension. Face-check passes: 2024 best hands Hooper/Akins/Moreau/Likely (0 drops); worst Cade Otton 7/59 (11.9%), David Njoku 7/74 (9.5%).
Why −0.05, not heavier: Initially considered −0.10 on the structural argument that TE separation is downweighted vs WR (0.07 vs 0.10) and on the position-emphasis intuition that hands matter more for TEs. Pulled back to −0.05 after a self-audit found that WR v1.1 had added wr_drop_rate at −0.08 without running the YoY noise check on it. When run after the fact, WR drop_rate YoY mean r ≈ +0.09 — indistinguishable from the fumble rate we removed. By the methodology's own threshold (|r| < 0.20 → "weight tiny ≤0.05 or remove"), the WR weight was over-weighted. Applying the rule consistently across positions: both TE and WR drop_rate land at −0.05. WR re-shipped as v1.2 (ADR-0015) the same day. See memory/project_te_v1_1_research.md for the full audit, including the measurement-error-suppression caveat that justifies inclusion at light weight despite weak YoY.
Pre-2022 seasons: FTN data starts 2022. For 2016-2021 TE seasons, te_drop_rate is NaN-neutralized to 0 composite contribution. Grade is computed from the remaining 5 components only.
Blocking-TE tier-2: Drop_rate stays at −0.05 in TE_V1_BLOCKING_WEIGHTS (same as TE_V1_WEIGHTS). Target-earn redistribution (→ EPA 0.406, YAC 0.314) unchanged.
Schema: No migration needed — ftn_receiving_charting already exists from WR v1.1 (migration 0014). TE grader joins it via player_id + season.
Follow-up: the WR drop_rate gap exposed a methodology hole — additions skipped the YoY noise check that removals applied. A cross-position audit (every component × every shipped position) is queued before any more positions ship. See memory/project_pending_audits.md.
v1.2 (2026-05-14) — target_earn_rate bumped, success_rate lowered (exhaustive audit)#
Two weight changes:
te_target_earn_ratebumped from +0.10 → +0.15 (effective share 11% → 16%).te_success_rate_per_targetlowered from +0.08 → +0.05 (effective share 9% → 5%).
Sum |weights| 0.92 → 0.94. Other components unchanged.
For the blocking_te tier-2 path, the redistribution of target_earn_rate weight scales: 0.15 redistributed to EPA + YAC in 0.35:0.27 proportion → EPA = 0.435, YAC = 0.335, separation = 0.07, success_rate = 0.05, drop_rate = −0.05.
Why: TE exhaustive candidate audit (docs/grading/audits/2026-05-14-exhaustive-te.md) scored 22 plausible TE candidates. Two key findings:
-
te_target_earn_rateis the strongest signal in the formula — Pro Bowl validity r = +0.301 (highest of any candidate, current or proposed) and YoY r = +0.610 (also highest among current components). At v1.1's 0.10 weight (11% share), it was meaningfully underweighted. Same finding pattern as WR v1.3. -
EPA-vs-success-rate redundancy at TE — max |r| = +0.723 with
te_rec_epa_per_target. Same mathematical pattern now confirmed at FOUR positions (QB 0.88, WR 0.76, RB 0.71, TE 0.72). Bounded at 0.05.
Validity gate passed (strongest Path A gain in any audit so far): TE composite vs next-year Pro Bowl correlation +0.384 → +0.407 (+0.023 improvement). TE was already the strongest offensive position in pre-audit baseline; now it's even stronger.
Face-check 2024:
| Rank | Player | v1.1 grade | v1.2 grade | Δ |
|---|---|---|---|---|
| 1 | George Kittle | 89.7 | 88.9 | −0.7 |
| 2 | Tucker Kraft | 86.2 | 85.2 | −1.0 |
| 3 | Jonnu Smith | 72.1 | 71.8 | −0.3 |
| 4 | Isaiah Likely | 73.1 | 71.5 | −1.6 |
| 5 | Mark Andrews | 72.6 | 70.7 | −1.9 |
| 6 | Trey McBride | 60.6 | 63.3 | +2.7 |
Brock Bowers rises 18 → 13 (+2.98). One of the known face-check misses — Bowers was undergraded by v1.1's efficiency-heavy formula despite elite target volume (153 targets as a rookie). The earn_rate bump partially corrects this.
Audit also found and documented:
te_separationhas NEGATIVE Pro Bowl validity (−0.053). Interpretation: TE voters reward tight-window catchers (Kelce/Andrews/Kittle) over open-route runners. Strong YoY (+0.413) says we're measuring real skill. Kept at 0.07 — don't reverse-engineer validity.te_pfr_broken_tackle_per_rec: independent signal (max_r +0.43), modest YoY (+0.419), weak validity (+0.117). Same YAC-skill gap as the QB rush/WR broken-tackle/RB-pre-v1.4 patterns. Documented as future consideration; not shipped — validity isn't strong enough to justify a Path B schema change (vs RB yards_after_contact which had +0.192).- 18 other candidates rejected with documented reasoning. Full table in the audit log.
Pattern across offensive positions (all 4 now audited):
| Position | EPA↔success r | success_rate change | target_earn change |
|---|---|---|---|
| QB | +0.883 | 0.25 → 0.10 | n/a |
| RB | +0.713 | 0.14 → 0.05 | n/a |
| WR | +0.763 | 0.08 → 0.05 | 0.10 → 0.15 |
| TE | +0.723 | 0.08 → 0.05 | 0.10 → 0.15 |
Consistent application of the methodology. The EPA-vs-success-rate redundancy is a structural pattern, not a coincidence.
v1 face-check: offense-context contamination in high-volume receiver grades
- Status: Accepted (v1 limitation, documented; fix deferred to v1.5)
- Date: 2026-04-24
- Companion to: ADR-0014 (RB v1), ADR-0015 (WR v1), ADR-0016 (TE v1)
Context#
After shipping WR v1 and TE v1 and running both against the 2024/2025 seasons, a face-check surfaced a recurring pattern: several high-volume receivers on bad offenses graded notably lower than their tape/production would suggest. The prompting case was Brock Bowers (LV, 2024) — the rookie-target-record holder at 153 targets who landed at grade 50.4 / rank 14 of 34 qualified TEs.
The open question was whether v1's grader has a systematic bias (treat all bad-offense receivers as underrated) or something narrower. We ran a pre-check on the 2024 data before picking a direction; the data shows the confound is narrower than "all bad-offense receivers" and also real enough to need written disclosure before declaring v1 done.
Finding#
Affected WRs — 2024, top-15 by targets#
| Name | Tm | Tgt | Grade | Rk / 84 | Tm EPA# | Top QB |
|---|---|---|---|---|---|---|
| Garrett Wilson | NYJ | 154 | 43.3 | 50 | 17 | 33.8 |
| Jerry Jeudy | CLE | 148 | 55.1 | 32 | 32 | 28.8 |
| Malik Nabers | NYG | 172 | 55.2 | 31 | 28 | 45.4 |
- Wilson: 1,100+ yds despite Rodgers' worst NFL season; ranked in the bottom 40% of qualified WRs.
- Jeudy: 1,229 yds on the league's worst offense (CLE, −0.183 EPA/play); ranked #32 is defensible but feels light.
- Nabers: rookie target record, 37th percentile grade.
Affected TEs — 2024, top-10 by targets#
| Name | Tm | Tgt | Grade | Rk / 34 | Tm EPA# | Top QB |
|---|---|---|---|---|---|---|
| David Njoku | CLE | 99 | 21.2 | 34 | 32 | 28.8 |
| Dalton Schultz | HOU | 93 | 30.0 | 31 | 22 | 31.7 |
| Brock Bowers | LV | 153 | 50.4 | 14 | 31 | 29.5 |
- Njoku: last among all qualified TEs despite 1,000+ snaps, solid reputation. Strongest single data point for offense contamination.
- Schultz: rank 31/34 with 93 targets on the Stroud-injured/Young HOU offense.
- Bowers: mid-pack grade for the highest TE target volume in 2024.
Six players across the two positions, all on offenses with top-QB grade below ~46. Matches the "bad QB play × high receiver volume" pattern.
What v1 handles correctly#
The methodology is not uniformly biased against receivers on weak offenses. Two cases prove the grader distinguishes efficient play from volume-only play inside a bad offensive environment:
Brian Thomas Jr. — 2024 WR, JAX#
- 135 targets, team EPA rank #18, top QB grade 44.9 (Lawrence's rough season)
- Grade 73.9, rank 10 / 84 — top-12 WR by grade despite the weak passing context.
A naive "bad offense → underrate" bias would predict Thomas below the WR median. He's in the top 12%.
Jonnu Smith — 2024 TE, MIA#
- 111 targets, team EPA rank #21, top QB grade 80.0
- Grade 71.4, rank 4 / 34 — top-5 TE.
MIA wasn't great offensively (below-average EPA), yet Smith's per-target efficiency was high enough to surface a top-5 grade.
Zach Ertz (WAS, 2024) is the inverse counter-example worth noting: WAS was a top-4 offense by EPA (top QB 78.7), Ertz ranked 24/34. Strong offense did not lift a clearly declining player. The grade was right.
These three cases together show the grader is responsive to per-target efficiency rather than team context as such.
The specific confound#
The failure mode is narrower than "bad-offense receivers underrated". It is specifically:
High-volume receivers whose targets are forced by their role on a team with below-replacement QB play.
Mechanics:
wr_rec_epa_per_targetandte_rec_epa_per_targetcarry ~35% of the composite. EPA is QB-dependent — the same route/catch generates less EPA when the QB throws late, off-platform, or low-completion.wr_yac_over_expected_per_rec/te_yac_over_expected_per_reccarry ~27%. xYAC is calibrated on league-average receptions; on a bad-QB offense, contested catches and off-schedule throws reduce real YAC relative to xYAC without the receiver doing anything wrong.wr_target_earn_rate/te_target_earn_ratecarries only ~10% and is a volume-adjacent signal — it helps, but not enough to outweigh the 62%+ from EPA and YAC-over-expected when both are QB-suppressed.
So a receiver who is forced to absorb record target volume on a team whose QB depresses EPA/target and YAC-over-expected across the board gets dinged twice (two big components each running 0.5–1.0 z below true skill) and credited once (one small component at +1.5 to +2.0 z for volume). Net: 5–15 composite points below a reasonable estimate.
The Thomas / Jonnu Smith counter-examples work because their per-target efficiency was high enough in absolute terms to offset the QB context — they weren't just surviving on forced volume.
Why naive offense adjustment is wrong#
The intuitive "residualize components by team offensive EPA" would:
- Over-correct Thomas and Jonnu Smith — they already showed the efficiency needed; an additional boost for "bad offense" makes their grades unjustifiably high and distorts the top of the leaderboard.
- Under-correct Bowers / Njoku relative to what they actually need — their issue is specifically per-target efficiency suppression from QB play, not general offense-level depression. Team EPA mixes run game + line play + YAC culture, so a team-EPA adjustment would dilute the QB-specific signal.
- Create new problems on good offenses — a good-offense receiver who's actually mediocre (Ertz 2024) would get a negative context adjustment and drop below where he belongs.
The right fix is usage-conditional and QB-specific: adjust per-target efficiency components for the QB quality the receiver was playing with, but only for the portion of targets that are "forced" (high target share on bad QB), and leave already-efficient-despite-bad-QB players unadjusted.
That is not a hotfix. It is a methodology change.
Decision#
Ship v1 as-is. Document the confound here. Do not modify weights, thresholds, or components. Do not layer a naive offense adjustment on top of v1.
Defer the real fix to v1.5.
v1.5 plan candidates (do not pick now; analyze first)#
- QB-quality-conditional z-scoring — when z-scoring
*_rec_epa_per_targetand*_yac_over_expected_per_rec, condition on the receiver's primary-QB composite grade (or a CPOE-derived QB quality score). Requires a second regression pass over historical seasons to calibrate. - Usage-residualized volume — add a "forced target share" signal and partially upweight it when the receiver's QB is below a threshold. Functions as a compensating positive weight only for the high-volume-on-bad-QB cell.
- Combination — (1) corrects the EPA/YAC depression, (2) credits the fact that absorbing forced volume is itself a skill signal.
All three need a validation pass against multi-season data before picking. Historical backfill of 2016–2023 (already flagged as the other major pending work) is a prerequisite — single-season analysis can't separate noise from true context effects.
UI mitigation for v1#
On player pages, display alongside the composite grade:
- Team offensive EPA/play and its league rank that season.
- Top QB grade on the player's team that season.
- If the player is a receiver (WR/TE/RB) with top-15 volume and their team's top QB grade is below ~45, a small inline note: "grade may be suppressed by QB context — see ADR-0017."
This does not change the grade. It surfaces the context the grade doesn't fully capture, so a user reading Bowers' 50.4 sees "Raiders offense #31, top QB 29.5" next to it and understands what they're looking at.
The note trigger is deliberately narrow (top-volume + bad QB) so it doesn't fire on every bad-offense receiver — that would dilute its meaning and contradict what the data actually shows (see Thomas / Smith).
Consequences#
Easier:
- v1 ships with a known, bounded limitation instead of an unfinished methodology fix. The boundary is written down and visible to users.
- v1.5 has a clear mandate backed by specific player cases to validate against (Wilson, Jeudy, Nabers, Njoku, Schultz, Bowers; counter- examples Thomas, Jonnu Smith, Ertz).
Harder:
- Until v1.5 lands, six named players per season carry visibly suppressed grades and users have to read the context panel to interpret them correctly. Acceptable for an MVP; not acceptable long-term.
- The UI has to carry context columns that wouldn't be needed if the grade self-adjusted.
Explicitly given up:
- Claiming v1 is "context-neutral". It isn't. It is "per-target efficiency-weighted within the population", which is adjacent but not the same. The /about page and the ADR index should both reflect that honestly.
References#
- 2024 face-check data (throwaway query, not committed) — results inlined above in §Finding and §What v1 handles correctly.
- ADR-0015 §Validation — the WR YoY-r band that would inform v1.5 calibration.
- ADR-0016 §Validation — TE YoY-r band.
- Pending: multi-season backfill (2016–2023) to enable usage- conditional z-scoring without overfitting to one season.
ADR-0018: CB v1 Grading Formula
Status: Accepted (v1.2 revision — 2026-05-14, see Revision History) Date: 2026-05-12
Context#
CB grading is the first defensive position in the system. The core challenge is
data: nflverse play-by-play records which defender broke up or intercepted a pass
(pass_defense_1_player_id, interception_player_id), but it does not record
which CB was in coverage on completions. This makes PBP-only CB metrics severely
biased — we can count PBUs and INTs, but not completions allowed or yards surrendered.
Data sources:
- Coverage stats (targets, completions, yards, YAC, TDs, INTs): PFR Advanced
Defensive Stats via
nflreadpy.load_pfr_advstats(stat_type="def"). The only free, publicly available source with full coverage-side metrics per CB per season. - Pass breakups (PBU): nflverse weekly player stats via
nflreadpy.load_player_stats(), columndef_pass_defended. PFR's advstatsbatscolumn is batted passes at the line of scrimmage (a pass-rush stat), not coverage PBUs — confirmed by inspection. nflverse box-score stats have ~95%+ of CB starters with non-zero PBU totals. - Defensive snap counts:
player_seasons.snaps_defense, populated by the snap-counts ingest. Used as the denominator for target rate.
Coverage: 2018+ only. PFR began publishing per-CB target/completion data in 2018. Seasons 2016–2017 have no CB grades.
Decision#
Metric Set (v1.1 passer-rating revision, 2026-05-14)#
| Component | Weight | k (shrinkage) | Direction | Rationale |
|---|---|---|---|---|
cb_passer_rating_allowed | −0.35 | 40 targets | Lower is better | NFL passer rating allowed when targeted. Industry-standard coverage damage metric combining comp%, yards per attempt, TDs allowed, and INTs into one number. Replaces separate cb_comp_pct_allowed and cb_int_rate components. The single cleanest CB skill signal in our dataset (2024 top 10 by this metric = consensus elite CBs: Stingley, Surtain, Humphrey, Wiggins, McDuffie, Gonzalez). k=40 because passer rating swings 25+ points off a single TD or INT in a 50-target sample. |
cb_yac_per_rec_allowed | −0.15 | 50 targets | Lower is better | Post-catch YAC reflects cushion allowed and tackling quality near the catch point — a distinct skill from preventing the catch. Distinct from passer rating allowed (which measures yards on the throw, not yards-after-catch). PFR publishes this for most seasons; missing values are NaN-neutralized. |
cb_target_rate | −0.08 | 150 snaps | Lower is better | Targets per defensive snap. Elite CBs get avoided — QBs scheme away from them before the ball is snapped, independent of what happens when they do throw. Denominator is defensive snaps (not coverage snaps, which are unavailable in public data), so the metric conflates avoidance with role depth; modest weight reflects this limitation. |
cb_pbu_rate | +0.12 | 80 targets | Higher is better | Pass breakups per target. Active defense that breaks up the catch. INTs are now captured inside cb_passer_rating_allowed, so this is PBU-only (vs v1 which counted PBU and INT separately). Sourced from nflverse def_pass_defended. |
Weight magnitudes: passer rating 50% + YAC 21% + PBU 17% + target rate 11% = 100% (combiner normalizes by sum of |weights| = 0.70 — same total as v1).
Why no tackling component?#
Tackling is what happens when coverage fails — a CB who tackles well after allowing a completion is still worse than one who didn't allow it. Comp% and YAC already penalize the underlying event. Adding tackling would partially reward CBs for the failure that led to the tackle opportunity. There is also a role confound: slot CBs make more tackles than outside CBs by position geography, not skill.
Why no TD rate component? (removed in v1.1)#
TD rate was in the original v1 formula at −0.07. It was dropped because:
- Rarity: a CB allows 2–5 TDs per season. Even with k=80 shrinkage, this is dominated by noise (r<0.15 YoY).
- Redundancy: a CB who allows TDs is already penalized via comp% (the catch happened) and YAC (it went to the end zone). The formula double-counted bad outcomes in a noisy, rare-event-driven way.
- At 7% weight, the noise contribution exceeded the signal contribution for any CB with fewer than ~8 TDs allowed (essentially all of them).
The 0.07 was reallocated: +0.03 to cb_pbu_rate (0.09→0.12) and the remaining
0.04 absorbed by the new cb_target_rate component.
Why comp% weight held at −0.22?#
The original v1.0 formula had comp% at −0.22 and no target rate. When target rate was added (−0.08), comp% could have been trimmed to avoid over-weighting coverage outcome signals. However: comp% and target rate measure different things (what happens on targeted plays vs. how often the QB throws his way at all), so they are not redundant. Keeping comp% at −0.22 preserves its role as the dominant single signal while target rate adds orthogonal avoidance information.
Qualification#
- Minimum targets to appear: 25 (appears in the system with "low volume" badge).
- Qualified threshold: 30 targets (included in the percentile pool).
- Confidence full at: 60 targets (~4 targets per game for a full-season starter).
Role Classification#
CBs are classified based on slot_pct from PFR:
| Role | Condition |
|---|---|
outside_cb | slot_pct < 35% |
hybrid_cb | 35% ≤ slot_pct ≤ 65% |
slot_cb | slot_pct > 65% |
Role is label-only — z-scores are computed against the full CB pool, not within role cohorts. With ~30–60 qualified CBs per season, splitting further would make z-scores unstable. The role label lets fans understand why a Patrick Surtain grade looks different from a Darius Slay grade.
Shrinkage k rationale#
- comp% and YAC (k=50 targets): Moderate shrinkage. After ~50 targets (~3 games of coverage), a CB's comp% is reliable enough that the empirical prior carries half the weight.
- INT and PBU rates (k=80 targets): High shrinkage because these are rare/noisy events (r<0.25 YoY). A CB with 30 targets and 3 INTs looks elite but may just have been lucky; k=80 pulls that toward the mean.
- target_rate (k=150 snaps): QB avoidance is more stable than rate stats (scheme-driven, not event-driven), so less shrinkage is warranted. However, the snap denominator includes snaps where the CB was in run defense or box alignment, not purely in coverage — k=150 provides modest pull toward the mean for low-snap players where this noise is largest.
Empirical Bayes shrinkage denominator#
The "sample size" for EB shrinkage is:
- targets for comp%, YAC, INT rate, PBU rate — the natural denominator for these per-target rates.
- snaps_defense for target_rate — the natural denominator for a per-snap rate.
YAC's rate denominator is completions, but its EB denominator is targets — this ensures YAC shrinks at the same rate as comp% for a given number of targets.
NaN Handling#
Standard NaN-neutralization policy (ADR-0015): if a component's z-score is NaN
(due to missing source data), it is replaced with 0.0 before entering the composite.
The raw NULL is preserved in stat_components.z_score so the UI renders "—".
Known NaN sources:
cb_yac_per_rec_allowed: NULL inpfr_def_coverage.yacfor some seasons.cb_pbu_rate: NULL for CBs absent from nflverse player_stats (edge cases).cb_target_rate: NULL ifsnaps_defense = 0(player_seasons not yet populated).
Alternatives Considered#
PBP pass_defense_1_player_id for PBU: Only captures ~31% of incompletions
(drops, overthrows, and throwaways get no credit). Too noisy and biased against
CBs who contest uncredited balls. Rejected.
PFR bats column for PBU: Confirmed via data inspection to be batted passes
at the line of scrimmage (pass-rush stat), not coverage PBUs. Most safeties and
CBs have 0 bats. Rejected — using nflverse def_pass_defended instead.
Yards per target instead of comp% + YAC: Simpler, but merges two different skills into one opaque number. They have different YoY reliability and warrant different k values. The decomposed form also gives the player profile page more granular insight per component.
TD rate in the formula: Included in v1.0; removed in v1.1. See rationale above.
Targets per coverage snap (not defensive snap): More precise than targets per defensive snap, but coverage snap counts are not available in any free public dataset. The full PFR advstats feed lacks a coverage-snaps column; participation data identifies FS/SS designations per play but does not cleanly distinguish "in coverage" from "in the box." Deferred to v2 if data becomes available.
Role-bucketed z-scoring (outside vs. slot): Correct in principle — an outside CB's comp% should be compared to other outside CBs. Deferred because with ~30–60 qualified CBs per season, splitting further produces unstable z-scores. The role label provides context without distorting the z-score distribution.
Consequences#
- CB grades available from 2018–present.
- Pipeline requires three nflreadpy sources:
load_pfr_advstats(stat_type="def")for coverage stats,load_player_stats()for PBU (def_pass_defended), andplayer_seasons.snaps_defense(populated by the snap-counts ingest) for target rate. - Historical seasons 2016–2017 return no CB grades.
- YAC component may be absent in some early seasons (2018–2019). PBU component may be absent for edge-case CBs not in the nflverse player_stats source. Target rate component is absent if snap-counts ingest has not been run for a season. All are NaN-neutralized gracefully.
- v1.1 formula changes require re-running
nflgrades grade --position CBfor all 2018–2025 seasons to update stat_components and season_grades.
Revision History#
2026-05-14 (passer-rating revision): Replaced two components — cb_comp_pct_allowed (−0.22) and cb_int_rate (+0.10) — with a single cb_passer_rating_allowed component at weight −0.35. Computed season-long from comp/targets/yards/TDs/INTs using the standard NFL passer rating formula. YAC weight reduced slightly (−0.18 → −0.15) to keep total |weights| at 0.70 (unchanged from v1).
Why: Passer rating allowed is the industry-standard NFL coverage metric. It captures all four sub-stats (comp%, yards/attempt, TDs, INTs) in one number with proper weighting — and critically, it penalizes TDs allowed (v1 didn't) while rewarding INTs (v1 did separately). 2024 face-check confirmed Marlon Humphrey moved #13 → #4, Christian Gonzalez #14 → #10; consensus elite CBs (Stingley, Surtain, Wiggins, McDuffie, Q. Mitchell) all in the top 10.
No data backfill needed — pfr_def_coverage already stored TDs allowed; v1 just didn't use them. Re-graded all 2018-2025 seasons.
v1.2 (2026-05-14) — cb_target_rate weight lowered (exhaustive audit)#
Lowered cb_target_rate from −0.08 → −0.05. Sum |w| 0.70 → 0.67. Other components unchanged.
Why: CB exhaustive candidate audit (docs/grading/audits/2026-05-14-exhaustive-cb.md) scored 11 candidates (4 current + 7 new from pfr_advstats_def). Key finding:
cb_target_ratevalidity vs next-year Pro Bowl = +0.013 (essentially zero), and the sign disagrees with the design weight direction. We model "elite CBs get avoided" with a negative weight, but at the qualified-CB level the validity SIGN is positive — meaning top corners actually face similar target volumes (all matched on WR1s). The avoidance effect exists at the league-wide level but doesn't differentiate within the qualified cohort.- Per methodology — when validity is near zero, weight should be ≤0.05. Bounded.
Validity gate: CB composite vs next-year Pro Bowl correlation +0.219 → +0.220 (essentially unchanged). Expected — target_rate was barely contributing to the composite because its validity was near zero. This is methodology cleanup (don't over-weight near-zero-validity signal), not a validity gain.
Face-check 2024: Top 4 unchanged (Derek Stingley Jr., Pat Surtain II, Nate Wiggins, Marlon Humphrey). Top 10 mostly the same; minor reshuffles at #5-10 (Paulson Adebo rose 54→41, +3.89, biggest mover).
What the audit confirmed about v1.1: the passer-rating-allowed consolidation was correct. All four PR sub-components (comp%, yards/att, TDs, INTs) either correlate ≥+0.62 with PR_allowed (subsumed) or are noise standalone. No reason to break PR_allowed back into pieces.
Honest take on CB validity ceiling: CB has structurally weak Pro Bowl validity (baseline +0.219, second-weakest after LB). Pro Bowl CB voting rewards narrative + interceptions + "shutdown" reputation more than per-target efficiency. This is position limitation, not formula bug. No realistic weight change will move CB validity from 0.22 → 0.40 — the ceiling is set by voter behavior.
What was REJECTED with documented reasoning:
cb_int_rate: highest standalone validity (+0.165) but mathematically inside passer_rating_allowed (INT events drop PR allowed by ~25 pts). Adding would double-count. The auto-verdict's "STRONG ADD" flag is a false positive — it doesn't know about mathematical containment. Documented.cb_comp_pct_allowed,cb_td_rate_allowed: also inside PR_allowed.cb_missed_tackle_rate: YoY +0.272, validity zero — Pro Bowl voters don't differentiate CBs on tackling.cb_tackles_per_snap: strongest YoY in audit (+0.490) but validity −0.107 — measures zone vs press style, not skill.cb_adot_allowed: scheme indicator (depth of targets), zero validity.
No new components added. The v1.1 4-component shape is structurally right; only weight tweak is the target_rate shrink.
ADR-0019: Safety v1 Grading Formula
Status: Accepted (v1.2 target_rate cleanup — 2026-05-14) Date: 2026-05-13
Context#
Safety is the second defensive position graded by the system. The core challenge is that safeties play two distinct roles — deep coverage and run support — and no single metric captures both. PBP data records which defender made a tackle or PBU but does not reliably identify the covering defender on completions (the same problem as CB, resolved the same way: use PFR's per-player coverage stats).
Data sources:
- Coverage stats (targets, completions, yards, INTs): PFR Advanced Defensive
Stats via
nflreadpy.load_pfr_advstats(stat_type="def"), same source as CB. - Pass breakups (PBU): nflverse weekly player stats via
nflreadpy.load_player_stats(), columndef_pass_defended. - Tackle stats (combined, TFL, sacks): nflverse player_stats —
def_tackles_solo + def_tackle_assists,def_tackles_loss,def_sacks. - Missed tackle count: attempted from
pfr_advstats_def(multiple column name variants). Stored as NULL if not found; the component is NaN-neutralized. - Defensive snap counts:
player_seasons.snaps_defense(snap-counts ingest).
Coverage: 2018+ only. PFR per-defender coverage data begins in 2018.
Decision#
Metric Set (v1.2 target_rate cleanup, 2026-05-14)#
| Component | Weight | Direction | Rationale |
|---|---|---|---|
s_passer_rating_allowed | −0.30 | Lower = better | NFL passer rating allowed when targeted. Industry-standard coverage damage metric combining comp%, yards per attempt, TDs allowed, and INTs into one number. Replaces separate s_comp_pct_allowed, s_yards_per_target_allowed, and s_int_rate components. 2024 face-check confirmed Kerby Joseph (9 INTs, All-Pro) #1, McKinney #2, Derwin James #3, Brian Branch #5 — consensus elites all in top 5. |
s_pbu_rate | +0.12 | Higher = better | Pass breakups per target. Active play that breaks up the catch. INTs now captured inside passer rating allowed; this is PBU-only (down from v1 PBU+INT bundle at 0.15). |
s_target_rate | −0.05 | Lower = better | Targets per defensive snap. QB avoidance signal. v1.2 audit confirmed near-zero validity (+r=−0.006 vs next-year Pro Bowl, sign disagrees with design weight). Lowered from −0.08 to a residual weight that keeps the skill-tree slot without overstating the signal. Denominator is snaps_defense (not coverage snaps, unavailable in public data), so it conflates avoidance with scheme role. |
s_tackles_per_snap | +0.07 | Higher = better | Combined tackles per snap. Run support and box coverage both require reliable tackling. |
s_missed_tackle_rate | −0.09 | Lower = better | Missed tackles / tackle attempts. Open-field technique matters most for safeties: a miss in space typically becomes a big gain. |
s_backfield_disruption_per_snap | +0.09 | Higher = better | (TFL + sacks) / snaps_defense. Measures pass-rush versatility from depth. Combined into one metric because TFL and sacks measure the same skill; combining doubles the event count and improves stability. |
Weight breakdown:
Coverage (62%): |−0.30| + |0.12| + |−0.05| = 0.47
Tackling (38%): |0.07| + |−0.09| + |0.09| = 0.25
Sum |abs| = 0.72
Why yards/target instead of YAC/rec?#
For CBs, the YAC decomposition (separate from comp%) captures a distinct skill
(cushion at the catch point + tackling quality). For safeties, who typically
defend deeper routes and assist in run support, the cleaner split is less
meaningful: a safety targeted on a post catches the ball in stride. YAC on those
routes reflects route design as much as coverage. yards_per_target collapses
the two signals into one, keeping the formula simpler without meaningful
information loss.
Why combined TFL + sacks?#
TFLs and sacks measure the same underlying outcome: stopping the play behind the line of scrimmage. Separating them at low sample sizes (1–3 sacks/season for most safeties) would produce two noisy, near-zero components. Combining them creates a more stable metric with k=300 snaps of shrinkage.
Qualification#
Snap-based, not target-based (unlike CB). Safeties can appear in 400+ snaps with very few coverage targets depending on scheme.
| Threshold | Value |
|---|---|
| Minimum snaps to appear | 200 |
| Qualified (percentile pool) | 400 |
| Full confidence | 700 |
Shrinkage k rationale#
| Component | k | Denominator | Rationale |
|---|---|---|---|
comp_pct_allowed, yards_per_target | 50 targets | targets | Moderate shrinkage; after ~50 targets reliability is sufficient. |
pbu_rate, int_rate | 80 targets | targets | Heavy shrinkage; rare events (r<0.25 YoY). |
target_rate | 150 snaps | snaps | Scheme-driven; less volatile than event rates. |
tackles_per_snap | 200 snaps | snaps | Stable over time; role is consistent across weeks. |
missed_tackle_rate | 100 tackle attempts | tackle_attempts | Technique is a real skill but angle/bounce introduces noise. |
backfield_disruption | 300 snaps | snaps | TFL+sacks rare; heavy shrinkage prevents overweighting hot starts. |
NaN Handling#
Standard NaN-neutralization (ADR-0015): if a component's z-score is NaN (missing
source data), it is replaced with 0.0 before entering the composite. The raw NULL
is preserved in stat_components.z_score so the UI renders "—".
Known NaN sources:
s_missed_tackle_rate: ifpfr_advstats_defdoes not include a missed-tackle column for a given release (column names vary). All players in that season will have NULL missed_tackles; the entire component is NaN-neutralized.s_pbu_rate: NULL when a safety is absent from nflverse player_stats (edge cases — players without a registered gsis_id in our DB).s_target_rate,s_tackles_per_snap,s_backfield_disruption_per_snap: NULL if snap-counts ingest has not been run for the season.
Alternatives Considered#
Target-based qualification (like CB): Rejected. Safeties in zone coverage can play 600+ snaps with very few direct targets. A target-based minimum (e.g. 25) would exclude most split-safety schemes and heavily penalize traditional free safeties. Snap-based qualification is position-appropriate.
Role-bucketed z-scoring (FS vs. SS): Correct in principle. Rejected for v1
for the same reason as CB role-bucketing: with ~30–50 qualified safeties per
season, splitting further produces unstable z-scores. Role labels (if added in v2)
would use pfr_advstats_def's alignment data.
Separate TFL and sacks components: Rejected due to sample-size instability. Most safeties record 0–1 sacks per season. At k=300 snaps, both a 0.0 and a 0.5 rate shrink heavily toward the mean — the combined metric is more stable with no meaningful information cost.
Coverage-only formula: Rejected. Tackling is a core job requirement for safeties in a way it is not for CBs. A safety who excels in coverage but misses tackles in space is not an elite player. The 30% tackling weight reflects real positional value.
Angles/context for missed tackles: PFR does not publish angle or distance data for missed tackles. The raw rate is accepted as-is.
Consequences#
- Safety grades available from 2018–present.
- Pipeline requires:
pfr_advstats_def,nflvs_player_stats, andplayer_seasons.snaps_defense. - Missed tackle data availability depends on the pfr_advstats_def column release; the component may be NaN-neutralized for some seasons. This will be revisited in v1.1 once data availability is confirmed for all seasons.
- Seasons 2016–2017 return no Safety grades (same PFR limitation as CB).
- To regenerate grades:
nflgrades grade --position S --season <year>for all seasons 2018–2025.
Revision History#
2026-05-14 (v1.2 target_rate cleanup): Lowered s_target_rate from −0.08 to −0.05 after the exhaustive candidate audit (audits/2026-05-14-exhaustive-s.md). 16 candidates were scored against four criteria (YoY reliability, cross-sectional discrimination, independence, predictive validity vs next-year Pro Bowl). s_target_rate returned validity r=−0.006 (essentially zero) with the sign disagreeing with the design weight direction — the same finding as CB v1.2. At the qualified-S level, top safeties face similar target volumes; voters do not reward target avoidance. The weight is kept (skill-tree placement: avoidance is part of coverage), just reduced.
Why: Methodology cleanup, not validity gain. Validity gate moved +0.253 → +0.255 (essentially unchanged), which matches the CB pattern: defensive-back validity is structurally capped by Pro Bowl voter noise (INT-driven), not by formula error. Both DB positions converged on the same target_rate finding.
No new components added. All four PFR passer-rating sub-components (s_comp_pct_allowed, s_yards_per_target_allowed, s_int_rate, s_td_rate_allowed) were rejected for subsumption — they correlate +0.55 to +0.62 with s_passer_rating_allowed, which already mathematically incorporates them. The nflvs aggregate event rates (s_tfl_per_snap, s_sack_per_snap, s_forced_fumble_per_snap, s_int_per_snap) were rejected for rare-event noise or redundancy with s_backfield_disruption_per_snap.
Face-check 2024: Top 5 unchanged (Kerby Joseph #1, Derwin James, Xavier McKinney, Brian Branch, Calen Bullock). Biggest movers small (max ±4.8).
Weight totals: v1.1 sum |abs| = 0.75 → v1.2 sum |abs| = 0.72.
2026-05-14 (passer-rating revision): Replaced three components — s_comp_pct_allowed (−0.13), s_yards_per_target_allowed (−0.08), and s_int_rate (+0.13) — with a single s_passer_rating_allowed component at weight −0.30. Reduced s_pbu_rate from +0.15 (PBU+INT bundle) to +0.12 (PBU-only) since INTs are now inside passer rating allowed. Tackling components unchanged. Required schema migration 0013_safety_tds_allowed.sql to add tds_allowed to pfr_def_coverage_s (CB table already had it).
Why: Passer rating allowed is the industry-standard NFL coverage metric and is the single cleanest safety skill signal we have. It penalizes TDs allowed (v1 didn't capture this at all) while still rewarding INTs and forced incompletions. 2024 face-check confirmed Kerby Joseph (9 INTs, First-Team All-Pro) #1, Xavier McKinney (8 INTs, Pro Bowl) #2, Derwin James #3, Brian Branch #5.
Known limitation: Kyle Hamilton (universally regarded top-3 safety) grades #13 in 2024 because his disguised-coverage style produces fewer direct target events. This is the same "stats vs film" gap noted for LB v1.1.
Weight totals: v1 sum |abs| = 0.82 → v1.1 sum |abs| = 0.75. Same coverage/tackling proportion (~67/33).
ADR-0020 — EDGE v1 Grading Formula
Status: Accepted (v1.2 tackle-volume add — 2026-05-14) Date: 2026-05-13
Context#
EDGE rushers are the primary pass-rush specialists on the defensive line. Grading them requires quantifying pass-rush production (pressures, sacks) and run-stop ability (TFLs), normalized by opportunity (defensive snaps).
Data Sources#
| Source | Columns | Coverage |
|---|---|---|
pfr_advstats_def → pfr_def_pass_rush | pressures, sacks, QB hits, hurries, comb_tackles, missed_tackles | 2018+ |
nflvs_player_stats → pfr_def_pass_rush | tfl (def_tackles_for_loss, sacks excluded) | 2018+ |
player_seasons | snaps_defense | 2016+ |
TFL double-count confirmation: nflvs_player_stats.def_tackles_for_loss is confirmed to NOT include sacks. Verified empirically: Dexter Lawrence (2024) had 9.0 sacks but only 8 TFL, proving the two fields are reported separately. No overlap between edge_sack_rate and edge_tfl_rate.
Components (v1.2, 2026-05-14)#
| Component | Formula | Weight | Direction |
|---|---|---|---|
edge_pressure_rate | pressures / snaps_defense | +0.35 | higher = better |
edge_sack_rate | sacks / snaps_defense | +0.30 | higher = better |
edge_tfl_rate | tfl / snaps_defense | +0.15 | higher = better |
edge_tackles_per_snap | comb_tackles / snaps_defense | +0.05 | higher = better |
edge_missed_tackle_rate | missed / (comb + missed) | −0.10 | lower = better |
Sum |weights| = 0.95. Normalized dynamically by composite.combine.
Relative shares: pressure 37%, sack 32%, TFL 16%, tackles 5%, missed tackles −11%.
Qualification (snap-based)#
| Threshold | Snaps |
|---|---|
| MIN to grade | 200 |
| QUALIFIED (main leaderboard) | 400 |
| Full confidence | 700 |
Shrinkage k Values#
| Component | k | Rationale |
|---|---|---|
| pressure_rate | 200 snaps | Moderate stability (r ≈ 0.5 YoY) |
| sack_rate | 350 snaps | Rarer events; heavier pull toward mean |
| tfl_rate | 300 snaps | Low per-snap frequency; needs shrinkage |
| tackles_per_snap | 200 snaps | Stable signal (YoY +0.520); matches pressure_rate's stability tier |
| missed_tackle_rate | 100 tackle_attempts | Real skill signal; moderate shrinkage |
Design Rationale#
Pressure rate dominant (39%): Total pressures (sacks + QB hits + hurries) is the most complete per-snap pass-rush signal available without pass-rush snap denominators. Weighting it most heavily captures the full spectrum of a rusher's production.
Sack rate separate (33%): Intentional partial overlap with pressure rate. Sacks are the premium outcome — the weight difference rewards players who convert pressure into sacks at a higher rate. Trey Hendrickson (54 pressures, 17.5 sacks, 2024) grades higher than a player with 54 pressures and 7 sacks, as intended.
TFL rate included (16%): EDGE rushers set the edge on run plays. Elite rushers like Myles Garrett (22 TFL, 2023) generate meaningful run-stop production beyond just pass rush. Excluding TFL would undervalue complete DEs.
Tackles per snap added (5%, v1.2): Captures activity level — chase tackles, RB at the catch point, screen-pass tackles — that the 89%-behind-LOS rest of the formula misses. The exhaustive audit confirmed this is an independent signal (max correlation +0.468 with TFL) with real validity (+0.216) and strong reliability (YoY +0.520). Voters reward EDGEs who show up across the box score, not only on premium-event plays. Weight kept small so it diversifies the signal without diluting the pressure-and-sack core.
Missed tackle rate penalty (−11%): Technique matters for edge rushers who must disengage and pursue in space. Weight kept modest (lower than Safety's −11%) because edge rushers make fewer total tackles and the metric is noisier per position.
Component Overlap (intentional)#
The three positive components (pressure_rate, sack_rate, tfl_rate) correlate strongly by design. Confirmed empirically by the 2026-05-14 pairwise correlation audit (qualified EDGE-seasons pooled 2018-2025, z-score correlation):
| Pair | Pearson r |
|---|---|
| pressure_rate ↔ sack_rate | +0.728 |
| sack_rate ↔ tfl_rate | +0.778 |
| pressure_rate ↔ tfl_rate | +0.599 |
The 0.80 of total positive weight (pressure 0.35 + sack 0.30 + tfl 0.15) carries roughly 0.50–0.60 worth of independent signal once redundancy is netted out. This is intentional, not a flaw. Sack rate is layered on pressure rate as a premium-event multiplier (the Hendrickson example in "Sack rate separate" above); tfl_rate adds run-stop credit that pure pass-rush weighting would miss. Edge rushers who make plays in the backfield tend to do all three — they're variants of the same underlying skill, weighted separately so each play-type contributes.
This block is documented so a future audit doesn't try to "fix" the redundancy by dropping one of the components. The correlation is a design feature.
See ../grading/audits/2026-05-14-correlation.md for the cross-position context (iDL has the same pattern; CB/S/LB do not).
Known Limitations#
OLB gap (3-4 schemes): Original v1 limitation — closed in v1.1 (2026-05-14). The EDGE grader now reads from both pfr_def_pass_rush (EDGE-tagged) and pfr_def_lb (LB-tagged pass-rush OLBs with ≥25 pressures and target rate <3.5%). See Revision History.
No pass-rush snap denominator: Total defensive snaps is used as denominator. This conflates run-defense snaps (where pressure rate is irrelevant) with pass-rush snaps. Elite rushers who are subbed out on early downs may be slightly penalized. Pass-rush snap data is not available in public data sources.
Data begins 2018: PFR per-player advanced stats start in 2018. Seasons 2016–2017 cannot be graded.
Alternatives Considered#
Pressure rate only (no sack split): Simpler but loses information about conversion efficiency. Rejected because sack rate is meaningfully independent — two players with identical pressure rates can have very different sack counts.
Equalizing pressure and sack weights (0.35 / 0.35): Reviewed per external feedback. Rejected because pressure rate captures more signal than sack rate alone (higher volume, more stable YoY), so it warrants a higher weight.
Excluding TFL from EDGE: Initial proposal excluded TFL. Added after review — elite edge rushers do generate real TFL volume on run downs, and excluding it understates their defensive value.
Revision History#
v1.2 (2026-05-14) — tackle-volume add from exhaustive audit#
Added edge_tackles_per_snap at +0.05 after the exhaustive candidate audit (../grading/audits/2026-05-14-exhaustive-edge.md). Ten candidates were scored against four criteria (YoY reliability, cross-sectional discrimination, independence, predictive validity vs next-year Pro Bowl). edge_tackles_per_snap returned YoY r=+0.520, validity r=+0.216, and max correlation with existing components +0.468 (with edge_tfl_rate). It is an independent signal capturing chase-tackles / ahead-of-LOS plays that the existing 89%-behind-LOS formula misses. Voter mechanism: elite EDGEs show up in the box score as tackle volume too, not only as pressure events.
Rejected candidates (documented in audit doc):
edge_qb_hits_per_snap(+0.735 correlation with pressure_rate — sub-component, would double-count)edge_hurries_per_snap(+0.706 correlation with pressure_rate — sub-component)edge_sack_per_pressure(+0.689 correlation with sack_rate — finishing already captured)edge_hit_per_pressure(validity −0.038 — near-zero, voters don't reward this slice)edge_forced_fumble_per_snap(validity +0.141, rare-event noise: typical EDGE has 1-2 FF/season)
Existing components confirmed: All four returned correct-sign, real-magnitude validity (pressure +0.291, sack +0.330, TFL +0.285, missed_tackle −0.056). No rebalance needed.
Validity gate: EDGE composite vs next-year Pro Bowl correlation +0.420 → +0.424. Modest gain, expected for a +0.05-weight add to an already-strong formula.
Face-check 2024: Top 5 unchanged in spirit — Trey Hendrickson (#1, 17.5 sacks, 1st Team All-Pro), Myles Garrett, Will Anderson Jr., Micah Parsons, Nik Bonitto. T.J. Watt at #15 reflects his down-sack year (11.5 vs career norm).
Weight totals: v1.1 sum |abs| = 0.90 → v1.2 sum |abs| = 0.95.
v1.1 (2026-05-14) — OLB-gap closure#
Closed the original v1 limitation where nflverse-classified LB pass rushers (T.J. Watt, Micah Parsons, Brian Burns, Nik Bonitto, Jared Verse, Josh Sweat, etc.) received no grades at any position. They failed the LB grader's target-rate filter (≥3.5%) for being pass-rushers, and the EDGE grader didn't see them because their position_played tag was LB. ~15-30 elite edge rushers per season were missing from the system.
Fix: The EDGE feature SQL now UNIONs two branches:
- EDGE-tagged players from
pfr_def_pass_rush(original v1 source). - LB-tagged pass-rush OLBs from
pfr_def_lb, filtered to:position_played = 'LB'pressures ≥ 25(real pass-rush production — separates them from blitz-heavy MLBs)target_rate < 0.035(matches the LB grader's exclusion threshold — no player is graded in both)
Both branches feed the same EDGE composite formula. pfr_def_lb and pfr_def_pass_rush have the same column shape for the components EDGE uses (pressures, sacks, comb_tackles, missed_tackles, tfl), so no other code changes were needed.
Verification: No player appears in both LB and EDGE for any season post-fix (the filter thresholds are designed to be mutually exclusive: LB requires target rate ≥3.5%, EDGE-via-OLB-branch requires target rate <3.5%).
Face-check after fix:
- Micah Parsons now graded all 5 seasons (2021 LB 83.9, 2022-2025 EDGE 70.5/81.9/86.8/85.6) instead of just 2.
- 2025 EDGE top 5: Garrett, Parsons, Sweat, Muhammad, Bonitto, Burns.
- 2024 EDGE top 5: Hendrickson, Garrett, Anderson, Parsons, Bonitto.
- 2023 EDGE top 5: Bryce Huff (DPOY runner-up), T.J. Watt, Hendrickson, Hines-Allen, Greenard.
All consensus elite pass rushers now appear in the EDGE leaderboard.
Why this works data-side without new ingest: pfr_def_lb was already populated for all LB-tagged players with PFR pass-rush data starting in 2018. The fix is purely a query-side change to the EDGE grader. No migration, no re-ingest, ~30 lines of SQL added.
ADR-0021 — iDL v1 Grading Formula
Status: Accepted (v1.2 rebalance + tackle-volume add — 2026-05-14) Date: 2026-05-14
Context#
Interior defensive linemen (iDL) are the run-stuffers and interior pass rushers on the defensive line. Their primary value is stopping the run at the line of scrimmage (TFLs) and collapsing the pocket from the inside. Grading them requires weighting run-stop production more heavily than for EDGE rushers, while still capturing pass-rush impact.
Data Sources#
| Source | Columns | Coverage |
|---|---|---|
pfr_advstats_def → pfr_def_pass_rush | pressures, sacks, QB hits, hurries, comb_tackles, missed_tackles | 2018+ |
nflvs_player_stats → pfr_def_pass_rush | tfl (def_tackles_for_loss, sacks excluded) | 2018+ |
player_seasons | snaps_defense | 2016+ |
Same pfr_def_pass_rush table as EDGE — the ingest covers all DL (both iDL and EDGE position codes) and the grader filters by player_seasons.position_played = 'iDL'.
TFL double-count: nflvs_player_stats.def_tackles_for_loss does NOT include sacks (same confirmation as EDGE — see ADR-0020). No overlap between idl_sack_rate and idl_tfl_rate.
Components (v1.2, 2026-05-14)#
| Component | Formula | Weight | Direction |
|---|---|---|---|
idl_pressure_rate | pressures / snaps_defense | +0.35 | higher = better |
idl_tfl_rate | tfl / snaps_defense | +0.25 | higher = better |
idl_sack_rate | sacks / snaps_defense | +0.20 | higher = better |
idl_tackles_per_snap | comb_tackles / snaps_defense | +0.05 | higher = better |
idl_missed_tackle_rate | missed / (comb + missed) | −0.05 | lower = better |
Sum |weights| = 0.90. Normalized dynamically by composite.combine.
Relative shares: pressure 39%, TFL 28%, sack 22%, tackles 6%, missed tackles −6%.
Qualification (snap-based)#
| Threshold | Snaps |
|---|---|
| MIN to grade | 200 |
| QUALIFIED (main leaderboard) | 400 |
| Full confidence | 700 |
Shrinkage k Values#
| Component | k | Rationale |
|---|---|---|
| tfl_rate | 300 snaps | Low per-snap frequency (~1–2%); needs heavy pull toward mean |
| pressure_rate | 200 snaps | Moderate stability (r ≈ 0.69 YoY — strong) |
| sack_rate | 350 snaps | Rarer events; heavier pull toward mean |
| tackles_per_snap | 200 snaps | Stable signal (YoY +0.516); matches pressure_rate's stability tier |
| missed_tackle_rate | 100 tackle_attempts | Real skill signal; moderate shrinkage |
Design Rationale (v1.2)#
Pressure rate primary (39%): The exhaustive audit (2026-05-14) revealed pressure_rate is both the most reliable iDL signal (YoY r = +0.689 vs TFL's +0.371) AND the most predictive of Pro Bowl voting (validity +0.460 vs TFL's +0.260). The original v1 design assumed "iDL = run-stop primarily," but modern Pro Bowl voting rewards the interior pass-rush archetype (Aaron Donald → Chris Jones → Quinnen Williams → Dexter Lawrence). v1.2 elevates pressure to the primary signal to match both reliability and voter consensus.
TFL rate secondary (28%): Still a meaningful iDL signal — elite interior players DO generate TFLs at well-above-average rates, and run-stop is a genuine part of the job. Just not the most reliable or most voter-rewarded skill. Kept at substantial weight (28% vs EDGE's 16%) to preserve the design principle that iDL run-stop matters more than EDGE run-stop.
Sack rate third (22%): Validity audit returned +0.394 — the second-highest in the formula. v1.2 raised it from 15% to 22% to reflect this. Interior sacks remain rarer than EDGE sacks structurally, but the play, when it happens, is a premium signal of elite interior pass rush.
Tackles per snap (+0.05, v1.2 add): Captures activity / chase-tackles that pressure/sack/TFL miss. The exhaustive audit found this is an independent signal (max correlation +0.532 with pressure_rate) with real validity (+0.281) and strong reliability (YoY +0.516). Voters reward iDLs who show up across the box score. Same finding as EDGE v1.2.
Missed tackle rate penalty (−5%): Lowered from −0.15 → −0.05 in v1.1 (cross-position YoY audit) because YoY r = +0.080 — barely above noise. v1.2 audit confirms validity is weak (−0.125, sign correct). Kept at −0.05 on skill-tree grounds.
iDL vs EDGE weighting difference: In v1.2 the two DL formulas have converged in structure (pressure-dominant) but diverge in TFL share: iDL at 28% vs EDGE at 16%. This is the right amount of differentiation — iDL run-stop matters more, just not enough to be the primary signal.
Component Overlap (intentional)#
The three positive components (tfl_rate, pressure_rate, sack_rate) measure overlapping aspects of "backfield disruption" and correlate strongly. Confirmed empirically by the 2026-05-14 pairwise correlation audit (qualified iDL-seasons pooled 2018-2025, z-score correlation):
| Pair | Pearson r |
|---|---|
| tfl_rate ↔ sack_rate | +0.737 |
| pressure_rate ↔ sack_rate | +0.778 |
| tfl_rate ↔ pressure_rate | +0.574 |
The 0.80 of total positive weight (tfl 0.35 + pressure 0.30 + sack 0.15) carries roughly 0.50–0.60 worth of independent signal. Interior linemen who beat blocks tend to do all three — make TFLs on runs, generate pressures on passes, and convert some pressures into sacks. The formula weights them separately so each play-type contributes to the grade, but the underlying skill they tap is largely shared.
This is intentional: weighting only one (say, TFL) would undercount pass-rush interior penetration; weighting only pressure would miss run-stop production. Documenting this here so a future audit doesn't try to "fix" the correlation by dropping a component.
Same pattern as EDGE (see ADR-0020 § Component Overlap). CB/S/LB formulas do not have this pattern — their components were designed to be more independent. See ../grading/audits/2026-05-14-correlation.md for cross-position context.
Known Limitations#
No pass-rush snap denominator: Total defensive snaps is used. iDL players may be subbed out in passing situations less often than EDGE rushers, so this conflation is less severe for iDL than for EDGE.
Position classification: Uses player_seasons.position_played = 'iDL' from our roster data. Nose tackles (NT) in 3-4 schemes are classified as iDL and included — they typically have lower pressure rates but may have high TFL rates on run downs.
Data begins 2018: PFR per-player advanced stats start in 2018. Seasons 2016–2017 cannot be graded.
Alternatives Considered#
Equal weights (pressure ≈ TFL ≈ 0.30): Reviewed. Rejected because TFL is the primary iDL differentiator and should be weighted more heavily — it's a harder play to make for an interior lineman and more directly measures the iDL skill set.
Using EDGE weights for iDL: Rejected. Applying the EDGE formula (pressure-dominant) to iDL undersells interior run-stopping and would rank players more similarly to EDGE rushers than their actual role warrants.
Revision History#
v1.2 (2026-05-14) — exhaustive audit rebalance + tackle-volume add#
Two-part change driven by the exhaustive candidate audit (../grading/audits/2026-05-14-exhaustive-idl.md). Ten candidates were scored against four criteria.
(a) Rebalance of existing positive weights. The audit revealed the v1.1 weights were MIS-ORDERED relative to both reliability and predictive validity:
| Weight order (v1.1) | Validity r | YoY r | |
|---|---|---|---|
| Should be primary | tfl_rate (0.35) | +0.260 | +0.371 |
| Should be secondary | pressure_rate (0.30) | +0.460 | +0.689 |
| Tertiary | sack_rate (0.15) | +0.394 | +0.450 |
The v1 design assumption "iDL = primarily run-stop TFL" reflected an older positional archetype. Modern Pro Bowl voting (Donald → Jones → Quinnen Williams → Dexter Lawrence) rewards interior pressure more, and the YoY data confirms pressure is also the more reliable signal. v1.2 reorders:
idl_pressure_rate: 0.30 → 0.35 (now primary)idl_tfl_rate: 0.35 → 0.25 (de-emphasized but still meaningful)idl_sack_rate: 0.15 → 0.20 (validity-justified bump)
(b) Add idl_tackles_per_snap at +0.05. Independent signal (max correlation +0.532 with pressure_rate), real validity (+0.281), strong reliability (YoY +0.516). Same finding as EDGE v1.2 — tackle volume captures activity / chase-tackles that pressure/sack/TFL miss. Path B add (no new ingest — comb_tackles was already pulled by the iDL grader as the missed_tackle denominator); just added tackles_per_snap = comb_tackles / snaps_defense to extract_features.
Rejected candidates (documented in audit doc):
idl_qb_hits_per_snap(+0.779 correlation with pressure_rate — sub-component)idl_hurries_per_snap(+0.709 correlation with pressure_rate — sub-component)idl_sack_per_pressure(YoY r = +0.008 — pure noise at iDL sample sizes; differs from EDGE where this had +0.122 YoY but was still rejected for subsumption)idl_hit_per_pressure(validity −0.052, near-zero)idl_forced_fumble_per_snap(YoY +0.096 below noise threshold; validity mostly co-occurrence with sack)
Validity gate: iDL composite vs next-year Pro Bowl correlation +0.457 → +0.475 (+0.018). Biggest validity gain from any defensive audit so far. The rebalance was the right call — voters reward what the data says they reward.
Face-check 2024: Top 8 are all 2024 Pro Bowl / All-Pro caliber — Leonard Williams #1 (career year, 11 sacks), Dexter Lawrence (1st Team All-Pro), Chris Jones, Braden Fiske (DROY runner-up), DeForest Buckner, Cameron Heyward, Vita Vea, Quinnen Williams. Coherent.
Weight totals: v1.1 sum |abs| = 0.85 → v1.2 sum |abs| = 0.90.
v1.1 (2026-05-14) — idl_missed_tackle_rate weight lowered (noise)#
Lowered idl_missed_tackle_rate from −0.15 → −0.05. Sum |w| drops 0.95 → 0.85; combiner normalizes so the three signal-strong positive components (tfl_rate, pressure_rate, sack_rate) get more effective weight.
Why: Cross-position YoY audit (2026-05-14) found mean YoY r = 0.080 across 2018-2025 for iDL missed_tackle_rate — one of the lowest signals in the entire grader system, below even the WR/TE drop_rate components at ~0.13. At −0.15 weight this was disproportionate noise contribution. Light weight (−0.05) preserves the technique-penalty direction without overweighting noise.
Why not removed entirely: Schema-stable change is preferred (pure weight tweak via the new preview/regrade workflow, per memory/reference_formula_iteration_workflow.md). Mean r 0.080 isn't zero — there's some in-season signal, just weak YoY. Light weight bounds the noise contribution while keeping the component available if we want to revisit later.
Face-check 2024: Top 3 unchanged (Leonard Williams, Chris Jones, Dexter Lawrence). Biggest movers up are interior linemen who'd been penalized for high missed-tackle rates: Quinnen Williams #16 → #9 (+7.84), Solomon Thomas #37 → #28 (+9.23), Jalen Carter #23 → #13 (+7.76). Coherent — these are players whose technique reputation isn't "missed tackler" but our noisy metric was treating them as such.
Audit data: memory/project_cross_position_yoy_audit.md. Shipped via nflgrades preview → edit weights.py → sync_weights_to_web.py → nflgrades regrade per season (the new workflow). End-to-end ~30 seconds.
ADR-0022 — LB v1 Grading Formula
Status: Accepted (v1.2 rebalance from exhaustive audit — 2026-05-14) Date: 2026-05-14
Context#
Off-ball linebackers are a multi-skill position — run defense, coverage, and situational pass rush. Unlike EDGE/iDL (pass-rush primary) or CB/S (coverage primary), LBs are graded across all three phases.
The biggest risk is misclassification: nflverse roster data classifies 3-4 OLB pass rushers (T.J. Watt, Micah Parsons, Haason Reddick, Andrew Van Ginkel) as LB rather than EDGE. Without a filter, those players would dominate the LB leaderboard with high TFL/pressure rates from pass-rush work, not LB skill.
Data Sources#
| Source | Columns | Coverage |
|---|---|---|
pfr_advstats_def → pfr_def_lb | tackles, missed tackles, pressures, sacks, targets, completions allowed, yards allowed, TDs allowed, INTs | 2018+ |
nflvs_player_stats → pfr_def_lb | TFL, PBU (pass defended), fumbles forced | 2018+ |
player_seasons | snaps_defense | 2016+ |
PBU data confirmed populated for off-ball LBs (Fred Warner 5-12 PBUs/yr, Roquan Smith 3-8 PBUs/yr; median qualified LB has 3 PBUs/year, only 7-14% have zero).
Components (v1.2, 2026-05-14)#
| Component | Formula | Weight | Direction |
|---|---|---|---|
lb_tfl_rate | tfl / snaps_defense | +0.20 | higher = better |
lb_passer_rating_allowed | NFL passer rating on targeted throws | −0.15 | lower = better |
lb_missed_tackle_rate | missed / (comb + missed) | −0.15 | lower = better |
lb_pbu_rate | pbu / targets | +0.05 | higher = better |
lb_tackle_rate | comb_tackles / snaps_defense | +0.13 | higher = better |
lb_pressure_rate | pressures / snaps_defense | +0.10 | higher = better |
Sum |weights| = 0.78. Normalized dynamically by composite.combine.
Relative shares: run defense ~58% (TFL 26% + tackle 17% + missed tackle penalty 19% — implicit, see math), coverage ~25% (passer rating 19% + PBU 6%), pass rush ~13%.
Passer rating allowed is computed season-long from (completions_allowed, targets, yards_allowed, tds_allowed, ints) using the standard NFL passer rating formula. PBU rate is PBU-only (not PBU+INT) because INTs are already captured inside passer rating allowed (a single INT lowers rating by ~25 points); double-counting would over-reward turnover-heavy LBs.
Qualification#
| Threshold | Value | Notes |
|---|---|---|
| MIN snaps to grade | 200 | |
| QUALIFIED snaps | 600 | Raised from 400 (other positions) |
| Full confidence snaps | 900 | Raised proportionally |
| MIN targets (absolute) | 15 | Off-ball role filter |
| MIN target rate | 3.5% | Off-ball role filter (targets / snaps) |
Why 600-snap qualified threshold (vs 400 for EDGE/iDL/S): LB per-snap rate stats are heavily inflated by limited-snap rotational specialists (sub-package run stuffers, nickel coverage LBs) whose narrow usage produces per-snap rates that every-down LBs can't match. At a 400-snap threshold, the top-10 LB leaderboard was dominated by 400-500 snap role players over 1000-snap workhorses like Bobby Wagner and Demario Davis. Raising to 600 suppresses this artifact.
Shrinkage k Values#
| Component | k | Rationale |
|---|---|---|
lb_tfl_rate | 300 snaps | Rare event (~1% of snaps for elite); heavy pull |
lb_passer_rating_allowed | 50 targets | Passer rating swings 25+ points on one TD or INT; heavy shrinkage |
lb_missed_tackle_rate | 100 tackle attempts | LBs make 80-180 tackles; moderate shrinkage |
lb_pbu_rate | 40 targets | Rare event rate per target |
lb_tackle_rate | 200 snaps | Volume signal, moderate stability |
lb_pressure_rate | 200 snaps | Most LBs near zero; shrink toward LB mean |
OLB Misclassification Filter#
Problem: nflverse classifies 3-4 OLB pass rushers as LB. Without filtering, they dominate the LB leaderboard via pass-rush production.
Filter: A player must satisfy all of:
position_played = 'LB'snaps_defense >= 200(MIN to grade)targets >= 15(absolute floor)targets / snaps_defense >= 0.035(target rate floor)
Why a target-rate filter, not a raw-target threshold: A raw threshold like "targets >= 20" lets pass-rush OLBs sneak through on incidental zone drops. Example: Andrew Van Ginkel 2024 had 22 targets (just above a 20 threshold) but only 922 snaps — a 2.4% target rate. Pure off-ball LBs run 5-9% target rate regardless of total snap count. The 3.5% threshold cleanly separates the two cohorts.
Players failing the LB filter are not graded as LB for that season. They flow to the EDGE grader instead via the OLB-gap closure branch added in ADR-0020 v1.1 (2026-05-14): the EDGE grader UNIONs pfr_def_lb rows where position_played='LB', pressures ≥ 25, and target_rate < 3.5%. The thresholds are mutually exclusive with the LB filter, so no player is graded twice.
Design Rationale (v1.2)#
TFL rate primary positive (+0.20): Cleanest LB run-defense signal — actual play-making behind the LOS, harder to inflate via team-context than raw tackle volume. Now the largest positive weight after the v1.2 rebalance lowered passer_rating_allowed.
Passer rating allowed (−0.15, lowered in v1.2): Industry-standard NFL coverage metric. Combines comp%, yards per attempt, TDs, and INTs into one number. v1.2 lowered the weight from −0.27 to −0.15 because the exhaustive audit revealed both weak reliability (YoY +0.146, just above noise threshold) AND weak predictive validity (−0.071) at LB-specific sample sizes. LBs have ~15-25 targets/qualified season vs 50-120 for DBs, so the same metric is structurally noisier here. Still the primary coverage signal, but right-sized.
Pressure rate (+0.10, bumped in v1.2): Most off-ball LBs rarely rush — Fred Warner / Roquan Smith have 5-12 pressures/yr on 1000 snaps (0.5-1.2% rate) vs. 4-6% for EDGE. v1.2 bumped the weight from +0.07 to +0.10 because the audit revealed pressure_rate has the HIGHEST positive validity (+0.149) of any LB component but was the LOWEST-weighted positive component. Same iDL-style mis-order pattern, but smaller magnitude (kept conservative because base rates are low). Rewards blitz-heavy MLBs (Patrick Queen, Kaden Elliss) without overstating position-wide impact.
PBU rate (+0.05) — PBU-only, not PBU+INT: INTs are already captured inside passer rating allowed. Keeping PBU as a separate component still credits the active "broke up the catch" play without double-counting interceptions.
Missed tackle rate penalty (−0.15): LBs make the most tackles of any position; misses cost the most. Same penalty weight as Safety.
Tackle rate (+0.13): Raw tackle volume has team-context contamination (bad defenses see more snaps, more plays). Audit confirmed strong YoY (+0.475 — highest in formula) but weak validity (+0.052 — voters don't reward tackle volume at LB). Meaningful for skill measurement but not dominant.
Known Limitations#
LB grades are noisier YoY than QB/WR/CB grades. Multiple sources of noise:
- Coverage target samples are small (30-90/yr).
- Scheme assignment shapes which LB is on the field for which plays.
- Yards-per-target is partially zone-dependent.
- TFL volume depends on DL play (penetration creates LB cleanup TFLs).
Expected YoY r band: 0.35-0.50. Wider/lower than offensive skill positions. Below 0.35 → formula issue or filter problem. Above 0.55 → suspicious (likely measuring usage rather than skill).
Per-snap rate vs. holistic film grade: PFF-style snap-level film grading captures technique on every rep including snaps where the LB isn't directly involved in a stat. We can't replicate that with publicly available stats. v1 measures per-snap statistical efficiency, which favors highly productive LBs and slightly disadvantages well-positioned LBs whose work is more about preventing plays than making them.
Some recognizable LBs may grade lower than fan/expert consensus. Fred Warner and Roquan Smith both had statistically below-average 2024 seasons relative to their peaks; our formula reflects the stats, not reputation. This is a feature, not a bug, but worth noting for users surprised by individual rankings.
Pass-rush OLB classification gap (carried from ADR-0020): Original v1 limitation — closed 2026-05-14. T.J. Watt, Micah Parsons, Brian Burns, Nik Bonitto, Jared Verse, Josh Sweat, and ~25 others per season are now graded as EDGE via the OLB-gap closure branch in ADR-0020 v1.1.
Alternatives Considered#
400-snap qualified threshold: Initial draft used 400. Rejected after face-check showed top-10 dominated by 400-500 snap role specialists (Leo Chenal, Edgerrin Cooper, Devin Bush) over every-down workhorses. 600-snap threshold restored consensus-style results (Zack Baun #1 in 2024).
Raw-target threshold for OLB filter: Rejected. 20-target threshold let Andrew Van Ginkel (22 targets, 27 pressures, 11.5 sacks) grade as the #1 LB. Target-rate filter (3.5%) handles all snap-count edge cases.
Equal coverage / run weights (50/50): Considered. Rejected because LBs are primarily second-level run defenders by role (50%+ of their snaps are run plays); 50% run / 37% coverage matches positional usage.
Including completion% allowed: Considered. Rejected because LB completion% is heavily zone-affected — passer rating allowed captures completion% as one of its four sub-components alongside yards, TDs, and INTs, in a more skill-isolated way.
Yards per target allowed as primary coverage metric: Used in the initial v1 release; replaced with passer rating allowed (see Revision History). Yards/target ignored TDs allowed (the premium negative outcome) and didn't reward INTs, leaving meaningful coverage skill un-measured.
Revision History#
v1.2 (2026-05-14) — exhaustive audit rebalance#
Two-component rebalance driven by the exhaustive candidate audit (../grading/audits/2026-05-14-exhaustive-lb.md). 19 candidates were scored against four criteria.
(a) lb_passer_rating_allowed: -0.27 → -0.15. Was the heaviest component (32% of formula). Audit revealed:
- YoY +0.146 (just above noise threshold — vs +0.143 at S/CB but with structurally larger samples there)
- Validity -0.071 (sign correct, magnitude tiny — vs -0.178 at S/CB)
- The metric is genuinely noisier at LB sample sizes (15-25 targets/season per qualified LB vs 50-120 for DBs)
- Pro Bowl voters reward LB coverage less than they reward DB coverage
Right-sized to its real signal strength. Still primary coverage signal at 19% of formula share.
(b) lb_pressure_rate: +0.07 → +0.10. Modest bump. Audit revealed:
- Highest positive validity in the formula (+0.149)
- Strong YoY (+0.407)
- Was the LOWEST-weighted positive component despite being most voter-validated
Conservative bump (vs iDL's larger +0.05 swing) because LB base pressure rates are very low and over-weighting could push blitz-specialists too high.
No new components added. The 6-component LB formula was confirmed structurally complete by the audit. All 13 new candidates were rejected:
- PFR passer-rating sub-components (comp_pct, yards/tgt, int_rate, td_rate): subsumed (+0.51-0.63 correlation with PR_allowed)
- Pass-rush sub-components (qb_hits, hurries, sack_rate): subsumed by pressure_rate (+0.70+ correlation)
- sack_per_pressure, hit_per_pressure: small samples or near-zero validity
- forced_fumble_per_snap, int_per_snap: rare-event noise (xsect 0.00)
- adot_allowed, yac_per_target_allowed: noise / subsumed
LB has a structural validity ceiling. Baseline +0.179 is the lowest of any audited position because Pro Bowl voting at LB is driven more by reputation than by box-score stats (the well-known "stats vs reputation" gap). Roquan Smith — universally regarded top-3 LB — grades #19 in 2024 because his box-score numbers don't reflect his consensus standing. No formula change can fix this without encoding reputation directly.
Validity gate: LB composite vs next-year Pro Bowl correlation +0.179 → +0.198 (+0.019). Strongest relative gain (+11%) of any defensive audit. Largest absolute Path A rebalance in the system. Holds; no rollback.
Face-check 2024: Top 2 unchanged in spirit — Zack Baun (DPOY runner-up, 1st-Team All-Pro) #1, Blake Cashman (Pro Bowl) #2. Movers include Roquan Smith (rose from below-cohort-median because he was being penalized by coverage stats and rewarded modestly by pressure). Notable stats-vs-reputation gap persists for Smith (#19).
Weight totals: v1.1 sum |abs| = 0.87 → v1.2 sum |abs| = 0.78.
v1.1 (2026-05-14, second revision) — lb_pbu_rate weight lowered (noise)#
Lowered lb_pbu_rate from +0.08 → +0.05. Sum |w| drops 0.90 → 0.87.
Why: Cross-position YoY audit found mean YoY r = 0.085 — noise. Light weight bounds noise without removing the signal completely.
v1.0 (2026-05-14) — initial release + passer-rating revision#
2026-05-14 (initial release): Used lb_yards_per_target_allowed (−0.20) and lb_pbu_int_rate (+0.13). Face-check on 2024 / 2023 showed elite consensus LBs (Fred Warner in his All-Pro 2023 season, Roquan Smith) graded lower than expected because yards/target is heavily scheme-dependent for LBs and doesn't capture TDs allowed or INT events.
2026-05-14 (passer rating revision): Replaced lb_yards_per_target_allowed with lb_passer_rating_allowed (weight −0.27, increased from −0.20). Split lb_pbu_int_rate → lb_pbu_rate (PBU-only, weight 0.08, decreased from 0.13) since INTs are now captured inside passer rating allowed. Reduced lb_tfl_rate from 0.22 → 0.20 to absorb the redistributed weight. Sanity-checked vs. 2025 CB and Safety cohorts — passer rating allowed produced clean signal at all three positions; flagged as candidate for CB v1.1 and Safety v1.1.
Face-check after revision: 2024 top 10 has Zack Baun #1, T.J. Edwards #2, Bobby Wagner #6 — all consensus picks. 2023 has Fred Warner #5 (All-Pro year), up from outside top 15 in the initial release. 2025 has Devin Lloyd #1 (5 INTs, elite coverage year).
v1.1 (2026-05-14, second revision) — lb_pbu_rate weight lowered (noise)#
Lowered lb_pbu_rate from +0.08 → +0.05. Sum |w| drops 0.90 → 0.87; combiner normalizes so the signal-strong components get marginally more effective weight.
Why: Cross-position YoY audit (2026-05-14) found mean YoY r = 0.085 across 2018-2025 for lb_pbu_rate — same noise pattern as iDL missed_tackle_rate (0.080). Since INTs are already captured inside lb_passer_rating_allowed (the −0.27 component), lb_pbu_rate was already a narrow "broke up the catch" signal; with weak YoY it was barely carrying its weight.
Why not removed entirely: Schema-stable change preferred. Cross-sectional spread is real (active PBU plays show up in the data), so the metric captures something. Light weight (+0.05) bounds the noise contribution without removing a real signal-carrier completely. If a later audit shows persistent weak signal we can remove.
Face-check 2024: Top 4 unchanged (Baun, Edwards, Dean, Hicks). Movers small (±2.4 max) — expected from a small weight delta. No reshuffles in the top half of the cohort.
Audit data: memory/project_cross_position_yoy_audit.md. Shipped via the new preview/regrade workflow.
ADR-0023 — K v1 Grading Formula
Status: Accepted (v1.1 FGOE correction — 2026-05-14) Date: 2026-05-14
Context#
Kickers are the first new graded position added under the "do it right" audit-first methodology (master plan locked 2026-05-14). Every weight in K v1 was decided after running the four-criterion exhaustive candidate audit (../grading/audits/2026-05-14-exhaustive-k.md), not designed first and audited later.
v1 scope: placekicking only (FG + XP). Kickoffs deferred to v2 because the 2024 dynamic-kickoff rule change broke year-over-year continuity of touchback/return rates. A future v2 add can revisit kickoff metrics once 2-3 years of post-rule-change data exists.
Data Sources#
| Source | Columns | Coverage |
|---|---|---|
nflvs_player_stats → kicker_stats | fg_att, fg_made (overall and by distance bucket), pat_att, pat_made, fg_long, gwfg_att, gwfg_made | 2016+ |
Grain: one row per (player_id, season). Ingest filters to position='K', season_type='REG', sums per-game counts to season totals.
Components (v1.1, 2026-05-14)#
| Component | Formula | Weight | Direction |
|---|---|---|---|
k_fg_over_expected_per_att | (total_makes − expected_makes) / total_att where expected_makes = Σ attempts_bucket × baseline_bucket (FG buckets + XP folded in) | +1.00 | higher = better |
Single-component formula. Sum |weights| = 1.00. No normalization needed.
League baselines (computed from kicker_stats 2016-2024)#
| Distance bucket | Baseline make rate | n_att in baseline window |
|---|---|---|
| 0-19 yd | 100.0% | 42 |
| 20-29 yd | 98.4% | 2,093 |
| 30-39 yd | 93.6% | 2,587 |
| 40-49 yd | 79.6% | 2,662 |
| 50-59 yd | 69.0% | 1,563 |
| 60+ yd | 40.0% | 65 |
| XP (post-2015 rule) | 94.3% | 10,941 |
Baselines are frozen as constants (K_V1_1_BASELINES in weights.py) so grades reproduce season-to-season without recomputing the baseline (era-fixed yardstick).
Per-attempt mechanics#
- 60-yard make → +0.60 over expected (large reward)
- 60-yard miss → -0.40 (modest penalty — it was hard)
- 30-yard make → +0.06 (tiny reward — expected)
- 20-yard miss → -0.98 (massive penalty — easy kick)
- XP make → +0.06 (rounding error, basically free)
- XP miss → -0.94 (heavily penalized)
This is risk-asymmetric by construction. A kicker like Brandon Aubrey who attempts 15 FGs from 50+ doesn't get punished for the misses (low expected baselines) but is heavily rewarded for the makes. A kicker whose coach never lets them try past 45 doesn't get a "safe" path to a high grade — they earn what they kick.
Qualification (FG-attempt based)#
| Threshold | FG attempts |
|---|---|
| MIN to grade | 10 |
| QUALIFIED (main leaderboard) | 20 |
| Full confidence | 30 |
Why FG-attempt based, not snap-based: Kickers don't have meaningful snap counts (special teams only). FG attempts directly measure the workload that produces our component metrics.
Shrinkage k Values#
| Component | k | Rationale |
|---|---|---|
k_fg_over_expected_per_att | 15 attempts (FG + XP total) | Low-workload kickers (rookies, injury fill-ins) get pulled toward FGOE = 0 (league mean) |
Design Rationale (v1.1)#
Single principled metric. The v1.1 formula is one number — FG Over Expected per attempt — that comprehensively captures kicker skill:
- Accuracy: automatic. Every kick is scored vs its distance baseline.
- Range: automatic. Making a 55-yarder is worth ~9x more than making a 25-yarder.
- Risk-asymmetry: automatic. A 60-yard miss costs little (it was hard); an XP miss is devastating (it shouldn't have been).
- XPs: folded in as a 7th distance bucket (post-2015 rule, ~94% baseline). Missing XPs hurts the grade as much as it should.
No additional components needed. The v1.1 audit confirmed every other plausible kicker metric is either subsumed by FGOE/att or pure noise:
k_fg_pct(overall): redundant with FGOE (which uses the same makes but weights by difficulty)k_fg_pct_40_plus(the v1 primary): redundant — FGOE handles 40+ explicitly via buckets and is more granulark_pat_pct: folded INTO FGOE as the XP bucketk_fg_long: validity ≈ 0; conceptual "power" is already expressed through FGOE on 50+ attemptsk_gwfg_pct: noise (n=49, validity 0.000)k_fg_pct_short: anti-skill (negative YoY due to regression to ceiling)
The methodology page surfaces context columns (raw FG%, longest FG, XP%) on the leaderboard for reader recognition, but they're labeled CONTEXT (not in formula) via the two-tier header. The grade itself is FGOE/att alone.
Rejected Candidates (audit log)#
Documented in the audit doc; summarized here for the article-defensibility goal:
k_fg_pct_short(0-39 yards): YoY r = -0.135. Negative YoY — regression to ceiling, not a skill signal. Excluded.k_fg_pct_50_plus: YoY r = +0.004 (essentially zero), small samples (3-8 attempts). Subsumed byk_fg_pct_40_plus.k_fg_pct_40_49: Smaller sub-bucket ofk_fg_pct_40_plus. Subsumed.k_gwfg_pct: Validity r = 0.000, n=49. Pure noise (game-winning FGs are 2-5 per kicker per season).k_fg_att_per_game: Usage marker — good teams attempt more FGs because they drive deeper. Not a skill signal.
NaN Handling#
Standard NaN-neutralization (ADR-0015): if a component's z-score is NaN (missing source data), it's replaced with 0.0 before entering the composite.
Known NaN sources:
k_fg_pct_40_plus: NaN if the kicker had zero 40+ attempts (rare; happens for backup kickers in committees).k_fg_long: NaN if no FG attempts. Filtered out by the MIN_FG_ATT_TO_GRADE threshold (10).k_pat_pct: NaN if zero XP attempts (extremely rare; offensive scheme dependent).
Alternatives Considered#
Expected FG% (xFG) model: A distance-adjusted accuracy metric (compare actual makes to expected makes from league baseline rates by distance). More sophisticated than k_fg_pct_40_plus alone, and would handle distance distribution differences across kickers. Rejected for v1 to keep the formula transparent and bound to raw nflverse columns. Candidate for v2 if validity gap warrants it.
Including k_gwfg_pct at light weight: Tempting because "clutch kicker" is a real concept fans believe in. Rejected because the audit returned validity r = 0.000 (no signal) and n=49 (only kickers with multiple GWFG attempts). Reputation-driven, not stat-driven.
Including kickoff metrics (touchback rate, hangtime): Rejected for v1 because the 2024 dynamic-kickoff rule change made touchback rate non-comparable to pre-2024 data. Will revisit after 2-3 post-change seasons.
Snap-based qualification: Kickers don't have meaningful snap counts. FG-attempt threshold is the right denominator.
Known Limitations#
Lowest validity baseline of any graded position (+0.165). Documented honestly in the audit doc. Kicker stats are structurally noisy:
- Small per-season samples (~30 FG attempts)
- Distance distribution varies by team (some kickers get more long opportunities)
- Pro Bowl K voting is reputation-driven (only 2 K Pro Bowls/year out of ~30 qualified)
k_fg_pct carries weak audit signal but is kept on definitional grounds (reader-recognizable). Future audits may reduce its weight if the validity gap doesn't close.
No wind/weather/dome adjustment. Outdoor kickers in Buffalo / Cleveland / Chicago face harder conditions than indoor kickers in Dallas / Detroit / Indianapolis. An adjusted-environment FG% would be a v2 candidate but requires per-game weather data we don't currently ingest.
Coverage starts 2016. nflvs_player_stats has older data but we cap at 2016 to match coverage with other position grades.
Pre-2015 XP attempts are noisier as a signal because XPs were 19-yard FGs (~99% league-wide). The 2016+ data covers the post-rule-change era exclusively.
Consequences#
- K grades available from 2016 onward (2016-2025 graded as of v1 ship).
- Pipeline requires:
kicker_statstable (migration 0016),nflvs_player_statsingest (already running for other positions). - To regenerate grades:
nflgrades grade --season <year> --position Kfor each season 2016-2025. - The lowest baseline validity becomes the "K floor" for cross-position comparisons. Documenting this is part of the audit-first article-defensibility goal.
Future Work#
v2 candidates (not for v1.1):
- Kickoff metrics post-2024 rule change (touchback rate, hangtime if data becomes available).
- Wind/weather/dome-adjusted baselines (requires per-game weather ingest). Currently baselines are pooled across all stadiums/conditions, which mildly disadvantages outdoor cold-weather kickers (Buffalo, Cleveland, Chicago) vs indoor kickers (Detroit, Atlanta, Indy).
- Per-attempt model: instead of bucketed baselines, fit a smooth function
p_make(distance)from PBP-level FG data. Marginal precision gain; v1.1's bucketed approach is interpretable and matches how kickers are discussed.
Any v2 add will go through the same four-criterion audit before shipping.
Revision History#
v1.1 (2026-05-14) — FGOE design correction (same-day)#
Replaced the v1 formula entirely. v1's four-component design (k_fg_pct_40_plus +0.40, k_fg_pct +0.25, k_pat_pct +0.15, k_fg_long +0.10) actively punished kickers who attempted long FGs. A 60-yard miss hurt v1's k_fg_pct and k_fg_pct_40_plus identically to a 35-yard miss, even though the former is league-average difficulty and the latter is a near-certain make. Brandon Aubrey is the case study: in 2024 he attempted 15 FGs from 50+ (most in the league) and made most of them; v1 graded him #4 because the missed long-range attempts dragged his raw rates down. A kicker whose coach never sent them past 45 looked better.
v1.1 fix: single component, k_fg_over_expected_per_att. Each kick is compared to the league baseline for its distance (computed from 2016-2024 data and frozen as constants). Risk-asymmetric by construction.
Audit support: the v1 audit had already shipped FGOE as a candidate. Its YoY r = +0.126 is the highest of any K candidate (next best k_pat_pct at +0.211 was disqualified as standalone since it doesn't capture FG range). Validity r = +0.091 is moderate — within the noise floor for K, but the philosophical case carries. See docs/grading/audits/2026-05-14-exhaustive-k.md.
Face-check 2024 (v1 → v1.1 movement):
- Chris Boswell: #1 → #1 (1st-Team All-Pro, consensus #1, formula agrees both ways)
- Brandon Aubrey: #4 → #2 (the headline correction — formula now rewards his 50+ make rate properly)
- Nick Folk: #2 → #3
- Wil Lutz: #5 → #4
- Justin Tucker (historic collapse): #28 → #23 — still well below average, but FGOE penalizes his misses less because some were long
- Jake Moody, Dustin Hopkins: bottom 2 in both versions (lost their jobs)
- Cameron Dicker (NFC Pro Bowl): #8 → #10 — his lower FG attempt count hurts him slightly more under FGOE
Validity gate: v1 composite r = +0.165 → v1.1 r = +0.153 (-0.012). Slight drop in Pro Bowl-prediction strength, well within noise floor for the K validity ceiling. Pro Bowl voting at K is reputation-driven; the drop reflects that voters reward FG% more than FGOE (which is a known voter behavior, not a formula flaw). The philosophical correctness of FGOE is the test, not validity for this position.
Leaderboard UI change: added a two-tier "FORMULA / CONTEXT" header pattern to the K leaderboard (PFF-style grouped header). The single FGOE/att column sits under FORMULA; raw FG%, FG% 40+, XP%, and longest FG are shown under CONTEXT for reader recognition without being scored. Pattern is K-only for now; could generalize to other positions later.
v1.0 (2026-05-14, deprecated same-day)#
Initial release with four raw make-rate components. Replaced within hours by v1.1 after recognizing the risk-aversion flaw. Documented here for the audit log.
ADR-0024 — P v1 Grading Formula
Status: Accepted (v1.1 blocked_rate removed — 2026-05-14) Date: 2026-05-14
Context#
Punters are the eleventh and final graded position in the foundation set. Designed audit-first per the locked methodology. The K v1.1 lesson (FGOE over raw rates) informed the candidate set — over-expected metrics were tested from the start.
Audit finding (critical): Unlike K, where FGOE per attempt cleanly dominated all alternatives, the analogous over-expected metric for P (EPA per punt) did not dominate raw rate metrics. p_net_avg beat p_epa_per_punt on both YoY reliability (+0.355 vs +0.269) and Pro Bowl validity (+0.166 vs +0.163). The K story does not generalize for punters because punt EPA mixes punter skill with opponent quality (returner, coverage), field position, and game state.
We therefore went with Option B: multi-component formula composing the two strongest individual signals (net average + inside-20 placement rate) plus a small block-rate penalty. Option A (single-component EPA per punt) was considered and rejected for lack of audit dominance.
v1 scope: all punting outcomes captured in nflverse pbp. No hangtime data (not available in nflverse).
Data Sources#
| Source | Columns | Coverage |
|---|---|---|
pbp → punter_stats | punts, gross_yards, return_yards, net_yards, inside_20, touchbacks, blocked, fair_catches, out_of_bounds, downed, epa_total, long_punt | 2016+ |
Grain: one row per (player_id, season). Aggregated from pbp rows where punt_attempt=1, grouped by punter_player_id, REG-season only.
Components (v1.1, 2026-05-14)#
| Component | Formula | Weight | Direction |
|---|---|---|---|
p_net_avg | net_yards / punts | +0.55 | higher = better |
p_inside_20_rate | inside_20 / punts | +0.30 | higher = better |
Sum |weights| = 0.85. Normalized dynamically by composite.combine.
Relative shares: net average 65%, inside-20 placement 35%.
(Block% remains visible on the leaderboard as a CONTEXT column — pulled directly from punter_stats raw counts — but is not scored. See v1.1 revision history for rationale.)
Qualification (punt-count based)#
| Threshold | Punts |
|---|---|
| MIN to grade | 25 |
| QUALIFIED (main leaderboard) | 40 |
| Full confidence | 60 |
Most starting punters have 50-80 punts per season. 40-punt threshold filters mid-season callups and committee splits without being overly restrictive.
Shrinkage k Values#
| Component | k | Rationale |
|---|---|---|
p_net_avg | 10 punts | Light — starters have 50-80 punts |
p_inside_20_rate | 15 punts | Moderate — bucketed event, ~30% league-wide |
Design Rationale#
Net average dominant (+0.55): Net yards per punt captures the actual field-position outcome — both punter leg strength (gross distance) and coverage-team / placement performance (return prevention). The audit found this is the most YoY-stable signal in the punter feature set (+0.355) AND the second-highest validity (+0.166). Net is also already implicitly risk-asymmetric: touchbacks cap the net (ball gets placed at the 20 regardless of how far the punt traveled), so a punter who blasts everything past the goal line gets credit only up to the touchback line.
Inside-20 rate (+0.30): Captures placement skill — the ability to angle a punt so it pins opponents deep without bouncing into the endzone. This is orthogonal to net avg: a 40-yard punt downed at the 5 looks identical to a 40-yard punt downed at the 35 by net average. The audit gave this the highest validity (+0.188) of any candidate. Lower YoY (+0.168) reflects that inside-20 attempts are partly opportunity-driven (you only get them when punting from the right field position).
Block-rate (removed in v1.1): v1 had p_blocked_rate at -0.05 weight on "punter conceptually owns the play" grounds. v1.1 removed it. The audit's near-zero YoY (-0.046) and validity (-0.046) showed it carries no skill signal, and most blocks are snap/protection failures. Even at small weight, the penalty was punishing punters for their teammates' mistakes. Block% remains on the leaderboard as context but is not scored.
iDL vs EDGE-style decision rejected (Option A): A single-component p_epa_per_punt formula was considered — analogous to K v1.1's FGOE/att. EPA per punt comprehensively captures distance, placement, return prevention, and blocks. But the audit showed it does NOT dominate net average:
| Metric | YoY r | Validity r |
|---|---|---|
p_net_avg | +0.355 | +0.166 |
p_epa_per_punt | +0.269 | +0.163 |
Why the K analogy doesn't carry: FG kicks have well-defined distance baselines (every 40-yard FG faces a similar challenge). Punt EPA depends on the returner, the coverage team, the wind, the field surface — mixing in non-punter-skill variance that dilutes the signal. We documented Option A as an alternative considered and chose B.
Rejected Candidates (audit log)#
p_gross_avg— YoY +0.334, validity +0.113. Subsumed by net avg (which uses the same gross yards but accounts for return).p_touchback_rate— validity −0.030 (near-zero). Touchback avoidance is implicitly handled by net average (touchbacks cap net), so a separate weighted component would double-count.p_return_yards_per_punt— validity −0.083 (sign correct, lower = better). Same subsumption logic as touchback: net avg already credits return prevention.p_fair_catch_rate— validity +0.013, near-zero. Fair catches are a returner decision driven by hangtime and coverage, not strictly a punter skill we can isolate.p_i20_minus_tb_per_punt— validity +0.189, basically ties inside_20_rate. Considered as an alternative to inside_20 but YoY r is identical (+0.163 vs +0.168) and the simpler "inside 20%" framing is more reader-recognizable.p_long_punt— YoY +0.077, validity +0.076. Weak both ways. Power signal exists but is dominated by net avg's continuous treatment.p_epa_per_punt— YoY +0.269, validity +0.163. The comprehensive over-expected metric. Did not dominate alternatives; would be a viable Option A formula but Option B (multi-component) won on individual-signal strength.
NaN Handling#
Standard NaN-neutralization: if a component's z-score is NaN, it's replaced with 0.0 before entering the composite.
Known NaN sources:
p_blocked_rate: NaN ifpunts = 0(filtered out by MIN_PUNTS_TO_GRADE).
Alternatives Considered#
Option A — single-component EPA per punt: See "Design Rationale" above. The over-expected approach didn't dominate, so we chose multi-component.
Including p_gross_avg: Tempting because it's the conventional punter headline. Rejected because it's strictly subsumed by net average (which uses the same gross yards but is more informative).
Hangtime data: Would be the ideal placement-skill metric. Not in nflverse. PFF charts hangtime but is paid. Deferred until a free source emerges.
Negative weight on returns/touchbacks: Both are implicitly in net average. Adding them as separate components would double-count.
Using i20_minus_tb_per_punt instead of inside_20_rate: Functionally equivalent in this audit (validity +0.189 vs +0.188). The simpler "inside 20%" rate is more reader-recognizable.
Known Limitations#
Lowest validity baseline of any graded position (+0.122). Even lower than K (+0.153). Reasons:
- Only 2 P Pro Bowls per year out of ~30 qualified (5% rate)
- Punter Pro Bowl voting is heavily reputation-weighted
- Net average and inside-20 are noisier YoY than offensive position primary metrics
No hangtime adjustment. A high hang time / short punt that's downed at the 5 is great punting; net avg sees it as a mediocre 35-yard kick. Inside-20 rate partially captures this. PFF-style hang time data would fix the gap but requires paid data.
Block rate is mostly snap/protection failure. Documented above. The −0.05 weight is conservative for this reason.
Coverage starts 2016. Earlier punter data exists in pbp but we cap at 2016 to match other position coverage windows.
Consequences#
- P grades available from 2016 onward.
- Pipeline requires:
punter_statstable (migration 0017),pbpingest (already running). - To regenerate grades:
nflgrades grade --season <year> --position Pfor each season 2016-2025. - Leaderboard uses the two-tier FORMULA / CONTEXT header pattern introduced for K — Net avg / Inside 20% / Block% sit under FORMULA; Gross avg / Long / TB% sit under CONTEXT.
Future Work#
v2 candidates:
- Hangtime once accessible.
- EPA per punt as an additional small-weight component if v2 validity testing supports it.
- Field-position-adjusted net average (different baseline by LOS).
- Net-over-expected (NEPA) model — would be the cleanest "over-expected" version if we can build a stable model.
Any v2 add will go through the same four-criterion audit before shipping.
Revision History#
v1.1 (2026-05-14) — p_blocked_rate removed (same-day)#
Removed p_blocked_rate from the formula (was -0.05 in v1). v1 kept it at small weight on "punter conceptually owns the play" grounds, but two issues made it unjustifiable:
- Audit signals were near-zero. YoY r = -0.046 (essentially noise — slightly negative), validity r = -0.046 (sign correct but tiny magnitude). Even at -0.05 weight the metric wasn't measuring skill, just adding noise to the composite.
- Most blocks are not punter responsibility. Snap quality and protection collapses cause the majority of blocks. Including it as a penalty punishes punters for their teammates' failures.
The event is also rare (1-2 per punter per season), so the per-punt rate is dominated by variance even at qualified samples.
Block% remains visible on the leaderboard as a CONTEXT column, pulled directly from punter_stats raw counts (not from stat_components since it's no longer scored). Readers can still see who got blocked; it just doesn't affect the grade.
Validity gate: unchanged at +0.122 (matches expectation — blocked_rate had near-zero contribution to the composite).
Face-check 2024: Top 3 unchanged (Jack Fox #1, Tommy Townsend #2, Logan Cooke #3). Small movements lower in the order: AJ Cole rose from #8 → #6 (had a 2.99% block rate — removing the small penalty helped him).
Weight totals: v1 sum |abs| = 0.90 → v1.1 sum |abs| = 0.85. Combiner normalizes, so net_avg's effective share rose from 61% → 65% and inside_20_rate from 33% → 35%.
v1 (2026-05-14, deprecated same-day for blocked_rate piece)#
Initial release with three components: p_net_avg +0.55, p_inside_20_rate +0.30, p_blocked_rate -0.05. Audit-first design via Option B (multi-component) chosen over Option A (single-component EPA per punt) because the K v1.1 FGOE-dominance pattern didn't generalize for punters. Replaced same-day by v1.1 after recognizing the blocked_rate flaw.
ADR-0025 — OL v1 Grading Formula (TEAM-LEVEL)
Status: Accepted (v1 audit-first release — 2026-05-14) Date: 2026-05-14
Context#
Offensive line is the twelfth and final position in the foundation queue. Designed audit-first per the locked methodology. This ADR is structurally different from every prior position ADR because the grading entity is a team-season, not a player-season.
Why team-level (and not per-player)#
nflverse data does not attribute pressures, sacks, run-blocking lanes, or pulled-block assignments to specific offensive linemen. Without paid PFF film grades, individual OL grading is not computable. The play-by-play feed records "X played QB, Y rushed, Z caught" but not "the LG missed his block."
Three options were considered:
- Skip OL entirely. Honest, but leaves a gap — every other position is graded.
- Faked individual grades. Distribute team OL outcomes across the 5 starters by snap share. This would invent attribution from nothing and is methodologically dishonest.
- Grade the OL as a UNIT per (team_id, season). Honest about what the data supports. Matches how analysts and coaches actually discuss OL ("Eagles OL was elite in 2024", not "Lane Johnson was elite").
We chose option 3. The grading entity is the team-season offensive line.
Data Sources#
| Source | Columns used | Coverage |
|---|---|---|
pbp | posteam, qb_dropback, sack, qb_hit, rush_attempt, rushing_yards, epa, penalty_type, penalty_team | 2018+ (limited by PFR rush availability for YBC) |
pfr_advstats_rush | rushing_yards_before_contact summed by team | 2018+ |
Aggregated to one row per (team_id, season) in team_ol_stats (migration 0018).
Schema (new tables, parallel to player-grading tables)#
team_ol_stats -- raw counts per team-season (sacks/dropbacks/rushes/YBC/etc.)
team_ol_components -- per-component values, mirrors stat_components shape
team_ol_grades -- composite grade per team-season, mirrors season_grades shape
Kept entirely separate from season_grades / stat_components so that player-centric queries don't need to learn a team-OL exception.
Components (v1, 2026-05-14)#
| Component | Formula | Weight | Direction |
|---|---|---|---|
ol_yards_before_contact_per_carry | yards_before_contact / rushes | +0.45 | higher = better |
ol_pressure_proxy_per_dropback | (sacks_allowed + qb_hits_allowed) / dropbacks | −0.45 | lower = better |
Sum |weights| = 0.90. Symmetric 50/50 run-block / pass-block split. Both metrics had nearly identical YoY reliability in the audit (+0.42 each).
Qualification#
Every team that played a season is graded (32/season × 8 seasons = 256 team-seasons). No qualification threshold — all teams have full-season volume on both denominators (rushes ≈ 400-550, dropbacks ≈ 500-700).
Confidence is fixed at 1.0 for the same reason.
Shrinkage k Values#
| Component | k | Rationale |
|---|---|---|
ol_yards_before_contact_per_carry | 30 carries | Light — every team has 400+ carries |
ol_pressure_proxy_per_dropback | 40 dropbacks | Light — every team has 500+ dropbacks |
Shrinkage is light because the per-team-season sample is large; we only want to bound noise from outlier early-season behavior, not pull strongly toward the mean.
Design Rationale#
Yards Before Contact per carry (run-block, +0.45)#
YBC isolates OL skill from RB skill. After-contact yards belong to the RB (breaking tackles, falling forward, second-effort yardage); before-contact yards belong to the OL (creating lanes, sustaining blocks long enough for the RB to reach the second level).
The audit returned YoY r = +0.424 — best of any candidate tested. The cleanest pure-OL run-block signal in the dataset.
Pressure proxy per dropback (pass-block, −0.45)#
(sacks + qb_hits) / dropbacks — the broadest pass-block damage signal we can compute from nflverse. Captures both extremes:
- Sacks: catastrophic — 7-8 yard losses, sometimes turnovers
- QB hits: meaningful contact even when the QB gets the throw off, often forces a subsequent miss or injury
The audit verified that standalone sacks_allowed_per_dropback and qb_hits_allowed_per_dropback are subsumed by this combined metric (max_r ≈ 0.86–0.96 with pressure_proxy). Using the combined version captures more of the OL's pass-block performance without double-counting.
We don't have full pressures (sacks + hits + hurries) because nflverse pbp doesn't track hurries. PFR has per-defender pressure totals but mapping them back to "pressures allowed by team X" requires a join we deferred for v2.
50/50 split#
Both signals had nearly identical YoY (+0.42). Neither has a clear dominance argument. A 60/40 lean toward pass-block (modern NFL is pass-heavy) was considered but ultimately rejected as arbitrary without external evidence.
Documentation note: PHI #5 in 2024#
Eagles OL is widely considered elite, but our formula puts them at #5 in 2024 with a 22.43% pressure rate (above league average for top-tier OLs). This is QB-dependent: Jalen Hurts holds the ball longer and takes hits while extending plays, inflating the pressure_proxy for what is otherwise a strong OL. This is a known limitation of using pbp pressure data — QB style mixes into the OL signal. Documented; not fixable without per-player blame attribution.
Validity Gate — Intentionally Skipped#
Per the locked plan and user decision: there is no "All-Pro OL unit" award. The closest proxy — counting next-year individual Pro Bowl OL per team — is too noisy to use as a hard gate (some Pro Bowls go to bad-unit veterans on reputation, e.g., Trent Williams on weaker 49ers lines).
We document the team-Pro-Bowl-OL-count as a possible future validity proxy but do not use it for v1 ship decisions. The audit relied on three criteria only: reliability (YoY), cross-sectional discrimination, and independence (max_r vs other candidates).
Rejected Candidates (audit log)#
13 candidates tested. Results in docs/grading/audits/2026-05-14-exhaustive-ol.md.
Subsumed by the chosen pair:
sacks_allowed_per_dropback— max_r +0.863 with pressure_proxyqb_hits_allowed_per_dropback— max_r +0.957 with pressure_proxysack_per_contact— max_r +0.620 with sacks_allowedrush_yards_per_carry— max_r +0.825 with rush_explosive_rate; mixes OL with RB after-contactrush_epa_per_carry— max_r +0.839 with rush_success_rate; mixes OL with RB and schemerush_success_rate— max_r +0.839 with rush_eparush_explosive_rate— max_r +0.825 with rush_yardsrush_stuff_rate— independent (max_r −0.548) but YoY +0.219 (weak)
Failed YoY (noise):
false_start_rate— YoY +0.129 (below 0.20 threshold)holding_rate— YoY +0.177ol_penalty_rate— YoY +0.168
The penalty exclusion deserves explicit defense. False starts and holding ARE the OL — those are literally OL players committing penalties, conceptually owned by the unit. We considered including them at small weight on definitional grounds. We rejected this because:
- The audit YoY is decisively below the noise threshold (0.13–0.18 vs 0.20 floor).
- We made the same "include despite weak signal on conceptual grounds" mistake with P v1 blocked_rate and reversed it within hours when the user pointed out that low audit signal means the metric isn't measuring what we think it's measuring.
- Penalty rates likely reflect roster turnover at OL positions year-to-year — not unit-level skill that persists.
If a v2 audit shows penalty signal at smaller cohort or different bucketing, we can revisit.
Alternatives Considered#
Single component (parallel to K v1.1's FGOE-only formula): Considered using just pressure_proxy or just YBC. Rejected because pass-block and run-block are conceptually distinct skills and both passed the audit at equivalent strength. A 1-component formula would force ignoring a real signal.
Three components with stuff_rate (+0.40 / +0.40 / -0.10): Considered. Stuff rate is independent (audit max_r −0.548) and is a real concept ("got blown up at the line"). Rejected because YoY +0.219 is just barely above the noise floor and adding a third component for marginal signal violates parsimony.
60/40 pass-heavy split: pressure_proxy −0.55, YBC +0.35. Reflects modern NFL where pass blocking matters more. Rejected as arbitrary — both signals had equal audit strength, and we didn't want to bake in an editorial preference without data support.
Per-player OL grading (synthetic blame attribution): See "Why team-level" above. Rejected as methodologically dishonest given nflverse data limitations.
Reusing season_grades with synthetic player_id = team_id: Considered for backwards compatibility. Rejected because every player-centric query (player profile, name resolution, snap-counts join) would need a team-OL exception. Cleaner to have separate tables.
Known Limitations#
No hangtime / no QB pocket-time isolation. Pressure proxy mixes OL skill with QB style (Hurts vs Goff face dramatically different "pressure" rates for the same OL quality).
No hurry data. Full pressure (sacks + hits + hurries) would be richer than our (sacks + hits). PFR has it per-defender; mapping that to team-allowed totals requires a join we deferred.
No scheme adjustment. Wide-zone teams generate more YBC than gap-scheme teams at equal OL talent. Our metric doesn't normalize.
No injury context. A team that lost its starting LT and RG mid-season isn't the same OL it started with. We grade the team-season aggregate as if it were one unit.
No All-Pro OL unit award means no validity gate. This is by design (see above) but means OL is the only graded position without a Pro Bowl validity check.
Consequences#
- OL grades available 2018+ (PFR rush limit).
- Pipeline requires:
team_ol_statstable (migration 0018), pbp ingest (already running), pfr_advstats_rush (already running for RB v1.4). - Web: OL appears as a position tab between TE and CB in the UX. Backend: separate table; frontend: shows up alongside players.
- Player profile pages are unchanged — OL data lives in different tables and players don't have OL grades attached.
Future Work#
v2 candidates:
- True pressure rate (sacks + hits + hurries / dropbacks) by joining PFR per-defender data back to team-opponent.
- Pocket-time-adjusted pressure rate (subtract expected pressure given QB time-to-throw).
- Scheme-adjusted YBC (rate vs expected given personnel and box count).
- Pro Bowl OL count as a sanity-check (not gate).
- Pass-block / run-block grade split if user wants two grades surfaced separately.
Any v2 add will go through the same three-criterion audit before shipping.
ADR-0026 — Team v1 Grading Formula
Status: Accepted (v1 design — 2026-05-25, pre-implementation) Date: 2026-05-25
Context#
With all 12 individual positions graded (ADR-0013 through ADR-0025), the natural next layer is team grades. Fans, analysts, and the methodology article itself benefit from a single number per team per season — and from the Offense / Defense / Special Teams split that breaks the number into its meaningful parts.
This ADR is structurally a sibling to ADR-0025 (OL v1) — both are team-level grades. But where OL grades a team-unit from raw pbp data, team grades aggregate the existing player grades. The aggregation methodology itself is what this ADR locks down.
Why aggregate player grades (not compute fresh from team stats)#
Four approaches were considered:
- Pure player aggregation. Composite of the per-position grades we already produce. Cleanest because every team grade reduces to "the players on this team." Can't capture pure team-level effects (scheme, coaching).
- Fresh team-level audit. Build a new 13th audit from team-aggregated pbp. Captures team-as-system but duplicates the work, and the result doesn't connect to the audited player grades.
- Hybrid (players + team adjusters). Player grades plus a few team-only signals (turnover margin, hidden ST yardage). Best ceiling but hardest to defend in v1.
- PFF-style phase grades on top of player grades. Three sub-grades (Offense, Defense, ST) built from the relevant position grades, plus an Overall composite.
We chose option 4. It uses the audited player grades as the foundation, produces three sub-grades that are themselves the most useful presentation, and matches how fans and analysts already think about teams. The hybrid layer is a documented v2.
Data Sources#
| Source | Used for |
|---|---|
season_grades (existing) | Per-position grades that feed each phase |
player_seasons.snaps_offense / snaps_defense / snaps_st | Snap-weighting within a position |
team_ol_grades (existing, ADR-0025) | OL unit grade — already team-level, no snap-weighting needed |
No new ingest. All inputs come from already-populated tables.
Two-Stage Aggregation#
Stage 1 — within a position#
For each (team, season, position), compute a snap-weighted average of every player who logged snaps at that position on the team:
position_team_grade(p, team) =
Σ(player.composite_grade × player.snaps_at_position)
/ Σ(player.snaps_at_position)
- Below-qualification players still count — their grade exists, their snaps are real, and excluding them would distort the actual team output.
- An injured starter and his replacement average proportionally — which is correct, because that is what the team got from the position.
- A 95%-snap starter dominates; a 20%-snap backup is a rounding error.
OL is exempt — team_ol_grades.composite_grade is already a single
team-season number from ADR-0025; no aggregation needed.
Stage 2 — across positions in a phase#
Position-weighted sum of the per-position team grades:
phase_grade(team) = Σ position_weight(p) × position_team_grade(p, team)
Position weights below codify "QB matters more than RB."
Overall composite#
overall_grade(team) =
w_off × offense_grade
+ w_def × defense_grade
+ w_st × st_grade
Position Weights (v1.0)#
Weights were derived empirically — see docs/grading/audits/2026-05-25-team-weights.md for the full audit (ridge regression of team success vs. snap-weighted per-position team grades, anchored by salary cap allocation as a market signal). Values below are the reconciliation of the two anchors plus the original priors; reasoning per row.
Offense (sums to 1.00)#
| Position | Weight | Reasoning |
|---|---|---|
| QB | 0.45 | Regression coefficient 0.61, univariate r=0.74 — the single dominant signal. Trimmed from full regression value because some of QB's apparent weight is multicollinear with WR. |
| OL | 0.25 | Regression supports 0.23, cap allocation supports higher; held at prior. 5-player unit affecting every play. |
| WR | 0.13 | Drops to ~0 in multivariate regression (multicollinear with QB) but univariate r=0.52 — WR genuinely matters, the regression just can't separate it from QB. Held meaningful. |
| RB | 0.09 | Devalued in modern offensive analytics; cap allocation agrees (~5%). |
| TE | 0.08 | Variable role; starter matters but ceiling lower than QB/WR. |
Defense (sums to 1.00)#
| Position | Weight | Reasoning |
|---|---|---|
| EDGE | 0.24 | Pass rush is the "QB of defense." Regression + cap both around 0.25. |
| CB | 0.24 | Coverage on the ball. Highest univariate r in defense (0.39). |
| LB | 0.22 | Regression supports a slight bump from prior. Front-7 anchor + nickel coverage. |
| S | 0.20 | Regression bumped from prior 0.15 → 0.23; landed at 0.20. Last-line value real. |
| iDL | 0.10 | Regression coefficient 0.01, univariate r=0.12 — weakest of any position. Reduced from prior 0.15 while keeping non-trivial weight (the iDL formula itself may under-capture interior pressure). |
Special Teams (sums to 1.00)#
| Position | Weight | Reasoning |
|---|---|---|
| K | 0.52 | Slight edge over punter on regression and cap. |
| P | 0.48 |
Return units intentionally omitted — public data on KR/PR is too noisy to grade. Future v2 may add hidden-yardage adjustments.
Phase weights (sums to 1.00)#
| Phase | Weight |
|---|---|
| Offense | 0.55 |
| Defense | 0.40 |
| ST | 0.05 |
Derived empirically (same audit as the position weights — see audit doc, v1.1 section). Phase-level regression of team success on offense/defense/st phase grades fits at R² = 0.79 (vs point diff) and 0.69 (vs closing spread). Regression said 0.58–0.64 / 0.34–0.36 / 0.02–0.06; reconciled toward 0.55 / 0.40 / 0.05.
Offense is meaningfully heavier than defense — modern NFL is offense-tilted in what moves team outcomes. ST is the small slice, closer to its salary-cap weight (~2%) than to the original 0.10 prior.
Edge Cases & Rules#
Multi-team players#
Snaps are attributed to the team where they were logged. A player traded mid-season contributes to each team in proportion to that team's snap share — the per-team aggregation naturally handles this via the snap-weighted average.
Players with no snaps at a position#
Excluded from that position's denominator. A player who logged 0 offensive snaps doesn't pull the offense grade.
Teams with a position gap (rare)#
If a team has zero graded players at a position (extreme injury wipeout,
or a team somehow with no qualifying kicker), the missing position's
weight is redistributed proportionally to the other positions in the
phase. Document the row's data_tier_reason field with the position
that was skipped.
Below-qualification players#
Their grade row exists (qualified=false), their snaps are real, and
they participate in the snap-weighted average. Excluding them would
distort what the team actually got from the position group.
Schema (new tables)#
team_grades (
team_id INT NOT NULL,
season INT NOT NULL,
overall_grade REAL NOT NULL,
offense_grade REAL NOT NULL,
defense_grade REAL NOT NULL,
st_grade REAL NOT NULL,
overall_percentile REAL,
offense_percentile REAL,
defense_percentile REAL,
st_percentile REAL,
data_tier_reason TEXT,
created_at TIMESTAMPTZ DEFAULT now(),
PRIMARY KEY (team_id, season)
);
team_grade_components (
team_id INT NOT NULL,
season INT NOT NULL,
phase TEXT NOT NULL, -- 'offense' | 'defense' | 'st'
position TEXT NOT NULL, -- 'QB', 'RB', ..., 'OL', 'K', 'P'
position_grade REAL NOT NULL, -- snap-weighted aggregate
weight REAL NOT NULL, -- position weight applied
n_players INT NOT NULL, -- distinct graded players that contributed
total_snaps INT NOT NULL, -- denominator of the snap-weighted avg
PRIMARY KEY (team_id, season, phase, position)
);
Kept separate from season_grades and from team_ol_* (which stays
focused on the OL unit specifically). Both consume from the same source
data and write to their own table family.
Qualification#
Every team that played a season is graded. 32 teams × 10 seasons (2016-2025) = 320 team-season rows. No threshold gate.
Composite → 0-100#
Same sigmoid as player grades (ADR-0008): grade = 100 / (1 + exp(-k(z - z0)))
with k=1.15, z0=0. Calibrated against the league-average team-season,
which means a 50 is a perfectly average team and a 90+ is an elite team
the way a 90+ QB is an elite QB.
Validation Plan#
Face-check (mandatory before ship)#
For 2024 and 2023 seasons, the top 5 and bottom 5 by overall grade must match the recognized contenders / cellar dwellers of those seasons. If they don't, the position weights are off — adjust and re-check before shipping.
Validity ground truth#
Same framework as positions (validity = correlation with an external truth), adapted for teams:
| Source | What it is |
|---|---|
| Vegas closing line | Closing point spread / total — market's best estimate of team strength. Available via historical odds archives. Recommended primary signal. |
| Point differential | Final regular-season scoring margin. Noisier but principled. |
| Playoff appearance / final W-L | Binary or near-binary; small validity ceiling. |
nflgrades validity --entity team would compute Pearson correlation
between the team overall grade and one of these. Target: r ≥ +0.50
against closing line (much higher than per-player validity ceilings,
since closing lines are more informative than Pro Bowl votes).
YoY reliability#
Compute YoY correlation of overall grade. Targets:
- Strong: r ≥ 0.50 (reflects roster + scheme continuity)
- Acceptable: 0.30 ≤ r < 0.50
- Concerning: r < 0.30 (suggests the methodology is too noise-dominated)
v1.0 face-check (2024 season, shipped 2026-05-25)#
| Rank | Team | Overall | Reality check |
|---|---|---|---|
| 1 | BAL | 95.2 | ✅ Top-2 offense, strong defense |
| 2 | DET | 90.7 | ✅ 15-2 #1 NFC seed |
| 3 | PHI | 85.0 | ✅ Super Bowl winners |
| 4 | GB | 73.7 | ✅ 11-6 playoff team |
| 5 | HOU | 72.0 | ✅ 10-7 playoff team, strong defense |
| 6 | BUF | 70.9 | ⚠️ Should arguably be top-3 (13-4, MVP Allen) |
| 7 | ARI | 69.7 | ⚠️ 8-9, missed playoffs (high) |
| 8 | SF | 67.7 | ❌ 6-11 season due to injuries (clearly too high) |
| 9 | TB | 64.8 | ✅ NFC South champs |
| 10 | KC | 64.5 | ⚠️ 15-2 Super Bowl runner-up (low — see below) |
Bottom 5: CAR (13.6), TEN (15.7), LV (18.5), CLE (18.8), NE (19.9) — all clear face-check ✅.
The KC and SF outliers reveal a v1.0 methodology characteristic worth naming: the grade measures per-snap player quality, snap-weighted across each team's available players. It does NOT measure full-season team output.
Implications:
- Teams with significant injury attrition (SF 2024) tend to grade HIGHER than their record because the grade rewards elite-when-healthy players proportionally to the snaps they played, not the games they missed. SF's Kittle/Bosa/Purdy snaps look elite; the system gives them credit for those snaps and is silent about the games those players missed.
- Teams that outperform their efficiency stats (KC consistently) tend to grade LOWER than their record. KC's per-snap efficiency was middling in 2024 (Mahomes had a down year by his standards); their wins came from clutch close-game play that doesn't show in per-play grades.
Documented as v1.0 limitation rather than fixed in v1.0 — fixing either would require introducing "team continuity" or "clutch performance" features, both substantial methodology changes. v2 candidate work.
Otherwise the face-check passes:
- Top 3 matches consensus
- Bottom 5 all clearly bad teams
- Mid-pack ordering is plausible with the two named exceptions
Design Rationale#
- Why snap-weighted within position, not starter-only. Snap-weighting is principled and handles injured-starter cases naturally. Starter-only requires defining "starter" (highest snap count? top of depth chart?) and creates a cliff effect.
- Why position weights at all (not equal weighting). QB matters more than RB in real football. Equal weighting would underrate QB-driven teams and overrate teams whose strength is a deep skill group.
- Why ST = 0.10 not 0.20. Kicker/punter swings are real but small. Overweighting ST would let an elite kicker drag a roster-bad team into the middle of the league. Roughly matches the share-of-variance estimate in public ST research.
- Why no shrinkage step. Aggregation is over already-shrunk player grades. Adding a second shrinkage layer would double-correct.
- Why Defense at 0.45 = Offense at 0.45. Modern NFL is offense-tilted in absolute production, but defense matters proportionally in winning. Phase balance is the safest starting point; an audit may move it.
Consequences#
Easier:
- Single number per team that summarizes the audited player work.
- Three sub-grades (Off/Def/ST) that are individually informative.
- Schema is small (2 tables) and decoupled from existing grading tables.
- No new ingest — uses what's already in
season_gradesandteam_ol_grades.
Harder:
- Position weights are choices that have to be defended. Each weight is a methodological commitment; tuning them is a v1.1 audit.
- A team grade is only as good as the position grades feeding it. Bugs in position grading propagate.
- Closing-line data isn't already in the repo. Validation requires pulling it (one-time scrape; small CSV).
Subject to revision:
- Position weights themselves (v1.1 — audit after face-check + validity).
- ST = 0.10 vs 0.15 (the only phase weight likely to move).
- Hybrid layer adding turnover margin / hidden ST yardage (v2).
Revision History#
- v1.0 (2026-05-25): Initial design. Snap-weighted within position +
position-weighted across positions in phase + phase-weighted into
Overall. Position weights derived empirically (regression + cap
allocation) per audit doc 2026-05-25-team-weights.md.
Major findings vs. original gut-feel priors:
- QB bumped 0.40 → 0.45 (regression said 0.61 but partly multicollinear with WR)
- iDL trimmed 0.15 → 0.10 (regression strongly supported a reduction)
- S bumped 0.15 → 0.20 (regression supported)
- Phase weights moved from prior 0.45/0.45/0.10 → 0.55/0.40/0.05 based on a second-stage regression of team success on phase grades (R² = 0.79). Offense is meaningfully heavier than defense in modern NFL; ST is closer to its cap weight (~2%) than to the original 0.10.
- Other moves all within ±0.03 of priors