cd ../projects
live

NBA Correlation Analysis Pipeline

End-to-end pipeline from web scraping to context-stratified correlation and DVP analysis across 27,990 player-game records

$ apt list --installed

PythonpandasscipyBeautifulSoupML
metrics.sh
$ metrics --project=nba-correlation-pipeline
> Records Processed: 27,990
> Correlations Calculated: 89,181
> Teams Scraped: All 30 NBA teams
> Methodology: Context-stratified + DVP

End-to-end pipeline that scrapes, cleans, and analyzes NBA player-game data — computing 89,181 context-stratified correlations across 27,990 records and mapping Defense vs. Position matchup advantages to surface insights that flat box-score analysis misses.

What It Does

Ingests raw game logs from Basketball Reference, normalizes inconsistent formatting across seasons and teams, then runs stratified correlation analysis across game conditions like pace, rest days, home/away splits, and opponent defensive rating. A dedicated DVP module maps each team's defensive vulnerability by position, identifying which defenses give up value to which player archetypes. The output is actionable correlation signals grouped by context — not just "Player X's points correlate with rebounds" but "that correlation flips in slow-paced road games against top-10 defenses, and this defense ranks bottom-5 against point guards."

Data Ingestion Layer

  • Built a custom web scraper for Basketball Reference covering all 30 NBA teams and full historical seasons
  • Handles rate limiting (20 req/min with randomized delays), anti-scraping measures, proper User-Agent headers, and session reuse
  • Parses inconsistent HTML table structures across different eras using BeautifulSoup with graceful retry logic
  • Outputs type-safe data via Python dataclasses (TeamInfo, CoachRecord, GameLog) into clean DataFrames

Correlation Engine

  • Computes Pearson and Spearman correlations at scale using scipy, layered on optimized pandas DataFrames
  • Stratifies by contextual factors: game pace tiers, rest days, matchup quality, home/away — revealing how correlations shift under different conditions
  • Every correlation includes associated p-values and confidence intervals for statistical rigor
  • Pace analysis module segments games into pace tiers, showing how player stat relationships change in fast vs. slow contests

DVP (Defense vs. Position) Module

  • Maps each team's defensive vulnerability by position (PG, SG, SF, PF, C) across scoring, assists, rebounds, and peripheral stats
  • Identifies which defenses give up disproportionate value to specific archetypes — a bottom-5 defense against centers is a different signal than their overall defensive rating suggests
  • DVP factors feed directly into the correlation engine, adding a matchup-aware layer on top of context-stratified analysis

Key Insight

Most sports analytics treats correlation as a single number. This pipeline treats it as a distribution that shifts with context and matchup. A player's points-assists correlation at home in high-pace games against a defense that's bottom-5 vs. their position is a fundamentally different signal than their season average — and that's where the edge lives.

$git clone [private repository]