NBA Correlation Analysis Pipeline
End-to-end pipeline from web scraping to context-stratified correlation and DVP analysis across 27,990 player-game records
$ apt list --installed
End-to-end pipeline that scrapes, cleans, and analyzes NBA player-game data — computing 89,181 context-stratified correlations across 27,990 records and mapping Defense vs. Position matchup advantages to surface insights that flat box-score analysis misses.
What It Does
Ingests raw game logs from Basketball Reference, normalizes inconsistent formatting across seasons and teams, then runs stratified correlation analysis across game conditions like pace, rest days, home/away splits, and opponent defensive rating. A dedicated DVP module maps each team's defensive vulnerability by position, identifying which defenses give up value to which player archetypes. The output is actionable correlation signals grouped by context — not just "Player X's points correlate with rebounds" but "that correlation flips in slow-paced road games against top-10 defenses, and this defense ranks bottom-5 against point guards."
Data Ingestion Layer
- •Built a custom web scraper for Basketball Reference covering all 30 NBA teams and full historical seasons
- •Handles rate limiting (20 req/min with randomized delays), anti-scraping measures, proper User-Agent headers, and session reuse
- •Parses inconsistent HTML table structures across different eras using BeautifulSoup with graceful retry logic
- •Outputs type-safe data via Python dataclasses (
TeamInfo,CoachRecord,GameLog) into clean DataFrames
Correlation Engine
- •Computes Pearson and Spearman correlations at scale using scipy, layered on optimized pandas DataFrames
- •Stratifies by contextual factors: game pace tiers, rest days, matchup quality, home/away — revealing how correlations shift under different conditions
- •Every correlation includes associated p-values and confidence intervals for statistical rigor
- •Pace analysis module segments games into pace tiers, showing how player stat relationships change in fast vs. slow contests
DVP (Defense vs. Position) Module
- •Maps each team's defensive vulnerability by position (PG, SG, SF, PF, C) across scoring, assists, rebounds, and peripheral stats
- •Identifies which defenses give up disproportionate value to specific archetypes — a bottom-5 defense against centers is a different signal than their overall defensive rating suggests
- •DVP factors feed directly into the correlation engine, adding a matchup-aware layer on top of context-stratified analysis
Key Insight
Most sports analytics treats correlation as a single number. This pipeline treats it as a distribution that shifts with context and matchup. A player's points-assists correlation at home in high-pace games against a defense that's bottom-5 vs. their position is a fundamentally different signal than their season average — and that's where the edge lives.