# Predictive Power of 3-Pointer for Team Win% in 21st Century NBA¶

CMSC320 Final Tutorial
Section 0101 Dickerson

## Motivation & Introduction¶

There are numerous reports and studies done about how the NBA has transformed in the past 5 years in regards to the 3-pt shot. The 3-pointer started to become dominant and popular in the NBA following the success of Stephen Curry and his Golden State Warriors (2015, 2017, & 2018 NBA Champions). Here are some great videos and articles about the "Three Point Revolution":

-How Data Changed the NBA by The Economist (Houston Rockets & Second Spectrum): https://www.youtube.com/watch?v=oUvvfHkXyOA&ab_channel=TheEconomist

-As discussed in the resources above, Data Science & Analytics made coaches, players, and team executives comfortable with adopting the 3-pointer. Now, we will use Data Science to test if those decisions paid off and are responsible for more wins.

-In this study, I will calculate and compare the predictive power of 3-pointer-related Season Average Player Statistics (3PFG%, 3PFGA, etc) in predicting NBA Regular Season team win-pct in different eras of the 21st century.

-The four eras will be 2000-2004, 2005-2009, 2010-2014, and 2015-2019.

-I will only use data from players that are "Starters" because they have outsized impact on their team's Win% and are enabled to take a wider array of shots. I define a "Starter" as a player that starts >= 50% of games in a season for their team, which is the same way that the NBA defines "Starter" when determining who is a non-"Starter" to award the 6th Man of the Year award.

## My Questions/Hypotheses¶

-Confirm the narrative: are players actually attempting more 3-pointers as time goes on? (a Simple Linear Regression: 3FieldGoalAttempted ~ Year)

-Are players becoming more efficient 3-point shooters as time goes on as a by-product of the global increase in 3-pointers attempted? (a Simple Linear Regression: 3FieldGoal% ~ Year) My rationale for this is that efficient 3-point shooting has likely gotten more valuable to teams due to my first hypothesis about more 3-pointers being attempted overall, meaning teams will opt to choose players with higher 3-point efficiency over time as their Starters.

-I hypothesize that 3-pt player stats (3-pointers attempted, 3-pointer efficiency, etc) will be statistically significant even when considering all of the other box score player stats for ALL 4 eras. (a Multiple Linear Regression: Win%~All Player Stats)

-Primarily, I am interested to see whether the predictive power of 3pt player stats starts to OUTWEIGH and DOMINATE the other player stats traditionally regarded as having the most predictive power (FG%, TREB, AST, etc) as time goes on.

## Data Collection (For Just 2018-2019 Season)¶

To show my Data Collection Step-by-Step with relevant output, I will first do a run-thru for just the 2018-2019 NBA Season.

I will also be defining functions along the way that will be used later when doing this process again for all 20 seasons in a for-loop.

In [1]:
import re
import requests
from bs4 import BeautifulSoup, Comment
from os import path
import pandas as pd
import re
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn import linear_model
import statsmodels.api as sm
from scipy import stats


Scrape Regular Season Per-Game Average Stats for all NBA players in 2018-2019 season.

Player Per-Game Season Averages URL Example (2018-2019 season): https://www.basketball-reference.com/leagues/NBA_2019_per_game.html

In [2]:
# Get Per-Game season averages for all NBA players from the 2018-2019 season.

def scrape_players(url):
r = requests.get(url)
#print(r) # Make sure we get Response 200

root = BeautifulSoup(r.content)
#print(type(root)) # Make sure BeautifulSoup object initializes

# Find just the HTML content under the table tag, which is where the data is!
player_stats_table = root.find("table").prettify()

# Use pandas's html reader to convert our prettified HTML table into a dataframe.

# Our df has more rows than the # of players shown on the original basketball-reference website.

# This is caused by some rows that are just copies of the column headers, since these rows help readers
# of the webpage remember what each column is as they scroll down the webpage.
# We can just use some filtering to drop these rows

# This is also caused by players that switched teams mid-season and have more than 1 row of season avg stats.
# For the purposes of our study, we won't merge them into one row, since we can't just collapse their contributions
# into just one team. Our study is about how the player's stats impacted the Win% of their team, so we need to keep
# the player's contributions to different teams as separate.

# For the reason above, we will also eventually drop columns with TOT as the team, since this a row that combines
# the stats of a player who played for more than one team in one season.

# We will determine whether the player is a starter using a % cuttoff (Games Started / Total Games for that team),
# and not a strict # of Games Started cutoff so that we don't drop starter-caliber players that just happened
# to switch teams mid-season.

return player_df

player_df_2019 = scrape_players(player_avgs_2019_url)
print("player_df shape: ", player_df_2019.shape)
display(player_df_2019)

player_df shape:  (734, 30)

Rk Player Pos Age Tm G GS MP FG FGA ... FT% ORB DRB TRB AST STL BLK TOV PF PTS
0 1 Álex Abrines SG 25 OKC 31 2 19.0 1.8 5.1 ... .923 0.2 1.4 1.5 0.6 0.5 0.2 0.5 1.7 5.3
1 2 Quincy Acy PF 28 PHO 10 0 12.3 0.4 1.8 ... .700 0.3 2.2 2.5 0.8 0.1 0.4 0.4 2.4 1.7
2 3 Jaylen Adams PG 22 ATL 34 1 12.6 1.1 3.2 ... .778 0.3 1.4 1.8 1.9 0.4 0.1 0.8 1.3 3.2
3 4 Steven Adams C 25 OKC 80 80 33.4 6.0 10.1 ... .500 4.9 4.6 9.5 1.6 1.5 1.0 1.7 2.6 13.9
4 5 Bam Adebayo C 21 MIA 82 28 23.3 3.4 5.9 ... .735 2.0 5.3 7.3 2.2 0.9 0.8 1.5 2.5 8.9
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
729 528 Tyler Zeller C 29 MEM 4 1 20.5 4.0 7.0 ... .778 2.3 2.3 4.5 0.8 0.3 0.8 1.0 4.0 11.5
730 529 Ante Žižić C 22 CLE 59 25 18.3 3.1 5.6 ... .705 1.8 3.6 5.4 0.9 0.2 0.4 1.0 1.9 7.8
731 530 Ivica Zubac C 21 TOT 59 37 17.6 3.6 6.4 ... .802 1.9 4.2 6.1 1.1 0.2 0.9 1.2 2.3 8.9
732 530 Ivica Zubac C 21 LAL 33 12 15.6 3.4 5.8 ... .864 1.6 3.3 4.9 0.8 0.1 0.8 1.0 2.2 8.5
733 530 Ivica Zubac C 21 LAC 26 25 20.2 3.8 7.2 ... .733 2.3 5.3 7.7 1.5 0.4 0.9 1.4 2.5 9.4

734 rows × 30 columns

Scrape NBA Team Standings at the end of the 2018-2019 season.

Team Final Standings URL Example (2018-2019 season): https://www.basketball-reference.com/leagues/NBA_2019_standings.html

For some reason, the well-formatted 'Expanded Standings' table from the URL above is stored inside HTML comments in the HTML source code. I had to track down which comment this was and use the fix found here to scrape the data: https://stackoverflow.com/a/52679343

In [3]:
team_standings_2019_url = 'https://www.basketball-reference.com/leagues/NBA_2019_standings.html'

def scrape_standings(url):
r = requests.get(url)
#print(r) # Make sure we get Response 200

root = BeautifulSoup(r.content, 'lxml')
#print(type(root)) # Make sure BeautifulSoup object initializes

# Scrape all the content within HTML Comments into a list comments

# By analyzing the list of comments above, I observed that the data we want
# is always contained in index 26 for all 20 NBA seasons we will scrape for.

# Re-initialize our BeautifulSoup object to only parse and scrape the HTML contents
# stored in the HTML Comment found above
comment_root = BeautifulSoup(all_standings_comment, 'lxml')

# Find all the data under the table tag, and prettify it so that it can be parsed
all_standings_table=comment_root.find("table").prettify()

# The data originally has a MultiIndex column structure to help indicate
# things like "Division", we don't need that.
team_df.columns=team_df.columns.droplevel()

return team_df

team_df_2019 = scrape_standings(team_standings_2019_url)
display(team_df_2019)

Rk Team Overall Home Road E W A C SE ... Post ≤3 ≥10 Oct Nov Dec Jan Feb Mar Apr
0 1 Milwaukee Bucks 60-22 33-8 27-14 40-12 20-10 13-5 14-2 13-5 ... 17-8 5-6 45-5 7-0 8-6 10-4 12-3 10-1 10-6 3-2
1 2 Toronto Raptors 58-24 32-9 26-15 36-16 22-8 12-4 10-8 14-4 ... 15-8 11-7 33-9 7-1 12-3 8-7 10-5 8-1 9-6 4-1
2 3 Golden State Warriors 57-25 30-11 27-14 22-8 35-17 6-4 8-2 8-2 ... 16-9 7-7 34-10 8-1 7-7 10-5 11-2 7-4 9-5 5-1
3 4 Denver Nuggets 54-28 34-7 20-21 20-10 34-18 7-3 6-4 7-3 ... 15-10 13-3 23-11 6-1 9-6 8-4 12-4 7-4 9-6 3-3
4 5 Houston Rockets 53-29 31-10 22-19 21-9 32-20 8-2 6-4 7-3 ... 20-5 5-7 29-12 1-5 9-6 11-4 8-6 8-4 12-3 4-1
5 6 Portland Trail Blazers 53-29 32-9 21-20 24-6 29-23 9-1 8-2 7-3 ... 19-6 4-6 29-8 5-2 8-7 8-7 11-4 6-3 10-5 5-1
6 7 Philadelphia 76ers 51-31 31-10 20-21 31-21 20-10 8-8 12-6 11-7 ... 14-10 10-8 22-16 4-4 12-4 7-6 11-4 6-4 9-5 2-4
7 8 Utah Jazz 50-32 29-12 21-20 20-10 30-22 6-4 7-3 7-3 ... 18-7 0-7 34-12 4-3 7-9 7-7 11-4 6-3 11-4 4-2
8 9 Boston Celtics 49-33 28-13 21-20 35-17 14-16 10-6 13-5 12-6 ... 12-12 5-6 24-12 5-2 7-8 9-5 11-4 5-6 8-7 4-1
9 10 Oklahoma City Thunder 49-33 27-14 22-19 21-9 28-24 6-4 8-2 7-3 ... 12-13 6-7 23-12 2-4 12-3 9-6 9-5 6-5 6-10 5-0
10 11 Indiana Pacers 48-34 29-12 19-22 33-19 15-15 9-9 11-5 13-5 ... 10-14 6-6 23-16 5-3 8-6 12-3 7-7 9-3 4-10 3-2
11 12 Los Angeles Clippers 48-34 26-15 22-19 20-10 28-24 6-4 7-3 7-3 ... 16-7 6-2 23-18 4-3 11-3 6-9 7-9 6-5 13-2 1-3
12 13 San Antonio Spurs 48-34 32-9 16-25 18-12 30-22 6-4 7-3 5-5 ... 15-8 7-4 25-16 5-2 5-10 11-5 10-5 3-7 10-4 4-1
13 14 Brooklyn Nets 42-40 23-18 19-22 29-23 13-17 8-8 10-8 11-7 ... 12-11 12-8 16-22 3-5 5-10 9-6 11-4 4-6 7-7 3-2
14 15 Orlando Magic 42-40 25-16 17-24 30-22 12-18 11-7 9-9 10-6 ... 15-8 5-6 21-21 2-5 9-7 5-8 5-11 8-3 9-5 4-1
15 16 Detroit Pistons 41-41 26-15 15-26 27-25 14-16 10-8 8-8 9-9 ... 15-11 6-9 16-21 4-3 8-4 4-11 6-10 7-3 10-6 2-4
16 17 Charlotte Hornets 39-43 25-16 14-27 29-23 10-20 8-10 11-7 10-6 ... 12-13 6-10 21-20 4-4 7-7 7-7 6-8 4-7 7-8 4-2
17 18 Miami Heat 39-43 19-22 20-21 23-29 16-14 7-11 9-9 7-9 ... 13-13 6-9 16-20 3-4 5-9 9-5 7-7 3-9 11-4 1-5
18 19 Sacramento Kings 39-43 24-17 15-26 18-12 21-31 3-7 7-3 8-2 ... 9-16 6-7 16-18 5-3 5-8 9-6 7-8 5-5 7-9 1-4
19 20 Los Angeles Lakers 37-45 22-19 15-26 12-18 25-27 1-9 5-5 6-4 ... 9-16 5-4 16-25 3-5 10-4 8-7 6-9 3-6 5-11 2-3
20 21 Minnesota Timberwolves 36-46 25-16 11-30 14-16 22-30 4-6 5-5 5-5 ... 9-16 8-5 19-22 4-4 7-7 6-9 8-6 4-7 5-9 2-4
21 22 Dallas Mavericks 33-49 24-17 9-32 15-15 18-34 4-6 6-4 5-5 ... 7-18 7-7 12-22 2-6 8-4 7-9 6-9 4-6 3-12 3-3
22 23 Memphis Grizzlies 33-49 21-20 12-29 9-21 24-28 3-7 3-7 3-7 ... 10-13 6-9 14-19 4-2 9-6 5-10 2-14 4-7 7-7 2-3
23 24 New Orleans Pelicans 33-49 19-22 14-27 10-20 23-29 3-7 5-5 2-8 ... 7-16 4-6 15-20 4-3 7-9 6-9 6-8 4-7 5-10 1-3
24 25 Washington Wizards 32-50 22-19 10-31 19-33 13-17 6-12 6-12 7-9 ... 8-16 5-6 14-23 1-6 7-8 6-9 8-6 3-7 7-10 0-4
25 26 Atlanta Hawks 29-53 17-24 12-29 16-36 13-17 4-14 6-12 6-10 ... 10-14 9-6 7-31 2-5 3-13 6-7 5-9 5-7 7-8 1-4
26 27 Chicago Bulls 22-60 9-32 13-28 16-36 6-24 4-14 3-13 9-9 ... 8-16 8-8 9-31 2-6 3-12 5-9 2-13 5-5 4-11 1-4
27 28 Cleveland Cavaliers 19-63 13-28 6-35 15-37 4-26 6-12 4-12 5-13 ... 7-17 5-4 6-43 1-6 3-11 4-12 3-12 4-6 4-11 0-5
28 29 Phoenix Suns 19-63 12-29 7-34 8-22 11-41 3-7 3-7 2-8 ... 8-15 5-5 5-40 1-6 3-12 5-11 2-13 1-8 5-10 2-3
29 30 New York Knicks 17-65 9-32 8-33 11-41 6-24 2-14 3-15 6-12 ... 6-18 4-7 6-41 2-6 5-10 2-12 1-12 3-9 1-13 3-3

30 rows × 24 columns

Scrape Current-Day NBA Abbreviations for NBA Teams

In [4]:
def scrape_abbrev():
r = requests.get(team_abbrev_url)
#print(r) # Make sure we get Response 200

root = BeautifulSoup(r.content, 'lxml')
#print(type(root)) # Make sure BeautifulSoup object initializes

abbrev_table = root.find("table").prettify()

# Rename columns, and remove first row (which is just the header from the original data source)
abbrev_df.columns = ["Abbrev", "Franchise"]
abbrev_df = abbrev_df[1:]

display(abbrev_df)

return abbrev_df

abbrev_df = scrape_abbrev()

Abbrev Franchise
1 ATL Atlanta Hawks
2 BKN Brooklyn Nets
3 BOS Boston Celtics
4 CHA Charlotte Hornets
5 CHI Chicago Bulls
6 CLE Cleveland Cavaliers
7 DAL Dallas Mavericks
8 DEN Denver Nuggets
9 DET Detroit Pistons
10 GSW Golden State Warriors
11 HOU Houston Rockets
12 IND Indiana Pacers
13 LAC Los Angeles Clippers
14 LAL Los Angeles Lakers
15 MEM Memphis Grizzlies
16 MIA Miami Heat
17 MIL Milwaukee Bucks
18 MIN Minnesota Timberwolves
19 NOP New Orleans Pelicans
20 NYK New York Knicks
21 OKC Oklahoma City Thunder
22 ORL Orlando Magic
24 PHX Phoenix Suns
25 POR Portland Trail Blazers
26 SAC Sacramento Kings
27 SAS San Antonio Spurs
28 TOR Toronto Raptors
29 UTA Utah Jazz
30 WAS Washington Wizards

## Data Preparation (For Just 2018-2019 Season)¶

Like Data Collection, I will be showing this process Step-by-Step for the 2018-2019 season dfs created before, and defining functions along the way that will help when tackling all 20 seasons in a for-loop.

player_df:

-Remove column header copy rows. These rows are in the website to help readers remember what each column means as they scroll down the webpage.

-Remove rows where Tm column value = TOT, as these are combined stats for players who played on more than one team in a season (due to trades). We can't have this, as it won't merge properly with our team standings table, since there is no team with abbreviation TOT.

In [5]:
def dataprep_players(player_df):
# remove rows that are just copies of the column headers
# they all have their name in the Player column as 'Player', so we can use this to filter them out
# Drop all rows with Player name = 'Player'
player_df = player_df[player_df['Player'] != 'Player']

# Drop rows with Tm=TOT, since these are combined stats for players who played on
# more than 1 team this season. We can't have this for our final analysis
player_df = player_df[player_df['Tm'] != 'TOT']

return player_df

print('Before Dropping Header Copy Rows & Tm=TOT rows: ', player_df_2019.shape)
player_df_2019 = dataprep_players(player_df_2019)
# Check that the overall # of rows has decreased.
print('After Dropping Header Copy Rows & TOT rows: ', player_df_2019.shape)

Before Dropping Header Copy Rows & Tm=TOT rows:  (734, 30)
After Dropping Header Copy Rows & TOT rows:  (622, 30)


team_df:

-Drop all columns except Team and Overall

-Use regex to create W and L column from the values in the Overall column. W is the # of wins the team had that season, and L is the # of losses.

-Use W and L columns to construct the win_pct column (Formula: Wins/Total Games or W/(W+L))

-Using win_pct in a season instead of Games Won is a natural standardizer/normalizer across seasons. Some seasons may have had less than the standard of 82 (blackout years, COVID-19 pandemic, etc), so Games Won should not be the value used as our response variable.

In [6]:
def dataprep_teams(team_df, abbrev_df):
# Merge with abbrev_df to bring over the abbreviations
team_df = team_df.merge(right=abbrev_df, left_on='Team', right_on='Franchise')

# Drop all columns except Team, Overall record, and Abbrev
team_df = team_df[['Abbrev', 'Team', 'Overall']]

# Create a W and L column out of the Overall column
for row in team_df.iterrows():
# Get the current row's Overall column value
curr_overall = row[1]['Overall']

# Extract wins and losses from Overall column value, and store the match groups
w_and_l= re.search(r"^(\d{1,2})-(\d{1,2})\$", curr_overall).groups()

# use the list of matches to retrieve wins and losses
wins = int(w_and_l[0])
losses = int(w_and_l[1])

# Store these values in new columns W and L
curr_index = row[0]
team_df.at[curr_index, 'W'] = wins
team_df.at[curr_index, 'L'] = losses

# No longer need Overall column
team_df = team_df[['Abbrev', 'Team', 'W', 'L']]

# Create win_pct column using formula: W / (W+L)
team_df['win_pct'] = team_df['W'] / (team_df['W'] + team_df['L'])

return team_df

team_df_2019 = dataprep_teams(team_df_2019, abbrev_df)
display(team_df_2019)

Abbrev Team W L win_pct
0 MIL Milwaukee Bucks 60.0 22.0 0.731707
1 TOR Toronto Raptors 58.0 24.0 0.707317
2 GSW Golden State Warriors 57.0 25.0 0.695122
3 DEN Denver Nuggets 54.0 28.0 0.658537
4 HOU Houston Rockets 53.0 29.0 0.646341
5 POR Portland Trail Blazers 53.0 29.0 0.646341
6 PHI Philadelphia 76ers 51.0 31.0 0.621951
7 UTA Utah Jazz 50.0 32.0 0.609756
8 BOS Boston Celtics 49.0 33.0 0.597561
9 OKC Oklahoma City Thunder 49.0 33.0 0.597561
10 IND Indiana Pacers 48.0 34.0 0.585366
11 LAC Los Angeles Clippers 48.0 34.0 0.585366
12 SAS San Antonio Spurs 48.0 34.0 0.585366
13 BKN Brooklyn Nets 42.0 40.0 0.512195
14 ORL Orlando Magic 42.0 40.0 0.512195
15 DET Detroit Pistons 41.0 41.0 0.500000
16 CHA Charlotte Hornets 39.0 43.0 0.475610
17 MIA Miami Heat 39.0 43.0 0.475610
18 SAC Sacramento Kings 39.0 43.0 0.475610
19 LAL Los Angeles Lakers 37.0 45.0 0.451220
20 MIN Minnesota Timberwolves 36.0 46.0 0.439024
21 DAL Dallas Mavericks 33.0 49.0 0.402439
22 MEM Memphis Grizzlies 33.0 49.0 0.402439
23 NOP New Orleans Pelicans 33.0 49.0 0.402439
24 WAS Washington Wizards 32.0 50.0 0.390244
25 ATL Atlanta Hawks 29.0 53.0 0.353659
26 CHI Chicago Bulls 22.0 60.0 0.268293
27 CLE Cleveland Cavaliers 19.0 63.0 0.231707
28 PHX Phoenix Suns 19.0 63.0 0.231707
29 NYK New York Knicks 17.0 65.0 0.207317

Merge player_df and team_df into player_winpct_df:

-With player_df as left and team_df as right, inner-Join on on Year & Team to create player_winpct_df

-Filter down to only starters from each year. This is to help see better patterns in 3FGA, as non-starters probably won't see a noticeable increase in 3-pointers attempted since they don't get many shots per game to begin with. Use the definition of starter used to determine the 6th Man of the Year award (https://en.wikipedia.org/wiki/NBA_Sixth_Man_of_the_Year_Award). To not be considered a "Starter", you need to start in less than 50% of games for your team. Therefore, we will define starter as players that start in >= 50% of games. We will create a new column start_pct by dividing GS column by G column, and use this to filter out rows.

-Assign a new column Year for each of the 20 seasons (just a constant 2019 in this example case). This will help with merges and visualizing our data later on.

-Only keep the following columns (to clean up and only keep statistics for our Linear Regressions that are simple and easy to interpret): Year, Player, Tm, start_pct, MP, PTS, TRB, AST, FGA, FG%, 3PA, 3P%, 2PA, 2P%, FTA, FT%, STL, BLK, TOV, PF, start_pct, win_pct

-Stats like FG (Field Goals made), FT, 3P, and 2P are omitted because we also have the corresponding Attempts and Accuracy (%) stats, so we can always derive "Made" stats like FG from those two stats if needed. Including all three could lead to problems and noise in our Linear Regression since the 3 stats would be so related to each other (because they are derived from each other).

In [7]:
def dataprep_combined(player_df, team_df, year):
# Initialize player_winpct_df by inner-joining on Year and Team abbreviation,
# with player_df as left and team_df as right
player_winpct_df = player_df.merge(right=team_df[['Abbrev', 'Team', 'win_pct']], left_on='Tm', right_on='Abbrev')

# Convert G and GS columns to integers so we can create start_pct column
player_winpct_df['G'] = player_winpct_df['G'].astype(int)
player_winpct_df['GS'] = player_winpct_df['GS'].astype(int)

# Create start_pct column (Gs / G)
player_winpct_df['start_pct'] = player_winpct_df['GS'] / player_winpct_df['G']

# Only keep rows with start_pct >= 0.500
player_winpct_df = player_winpct_df[player_winpct_df['start_pct'] >= 0.500]

# Add the Year columns
player_winpct_df['Year'] = year

# Only keep columns listed above
player_winpct_df = player_winpct_df[["Year", "Player", "Tm", "start_pct", "MP", "PTS", "TRB", "AST", \
"FGA", "FG%", "3PA", "3P%", "2PA", "2P%", "FTA", "FT%", \
"STL", "BLK", "TOV", "PF", "start_pct", "win_pct"]]

return player_winpct_df

player_winpct_df_2019 = dataprep_combined(player_df_2019, team_df_2019, 2019)
display(player_winpct_df_2019)

Year Player Tm start_pct MP PTS TRB AST FGA FG% ... 2PA 2P% FTA FT% STL BLK TOV PF start_pct win_pct
1 2019 Steven Adams OKC 1.000000 33.4 13.9 9.5 1.6 10.1 .595 ... 10.1 .596 3.7 .500 1.5 1.0 1.7 2.6 1.000000 0.597561
7 2019 Terrance Ferguson OKC 1.000000 26.1 6.9 1.9 1.0 5.8 .429 ... 1.9 .560 0.7 .725 0.5 0.2 0.6 3.1 1.000000 0.597561
8 2019 Paul George OKC 1.000000 36.9 28.0 8.2 4.1 21.0 .438 ... 11.1 .484 7.0 .839 2.2 0.4 2.7 2.8 1.000000 0.597561
9 2019 Jerami Grant OKC 0.962500 32.7 13.6 5.2 1.0 10.3 .497 ... 6.6 .555 2.8 .710 0.8 1.3 0.8 2.7 0.962500 0.597561
17 2019 Russell Westbrook OKC 1.000000 36.0 22.9 11.1 10.7 20.2 .428 ... 14.5 .481 6.2 .656 1.9 0.5 4.5 3.4 1.000000 0.597561
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
543 2019 Reggie Bullock DET 1.000000 30.8 12.1 2.8 2.5 10.0 .413 ... 3.3 .463 1.5 .875 0.5 0.1 1.2 1.8 1.000000 0.500000
545 2019 Andre Drummond DET 1.000000 33.5 17.3 15.6 1.4 13.3 .533 ... 12.8 .548 5.2 .590 1.7 1.7 2.2 3.4 1.000000 0.500000
547 2019 Wayne Ellington DET 0.928571 27.3 12.0 2.1 1.5 9.8 .421 ... 2.0 .607 1.2 .758 1.1 0.1 0.9 1.9 0.928571 0.500000
549 2019 Blake Griffin DET 1.000000 35.0 24.5 7.5 5.4 17.9 .462 ... 10.9 .525 7.3 .753 0.7 0.4 3.4 2.7 1.000000 0.500000
550 2019 Reggie Jackson DET 1.000000 27.9 15.4 2.6 4.2 12.8 .421 ... 7.0 .464 2.9 .864 0.7 0.1 1.8 2.5 1.000000 0.500000

180 rows × 22 columns

## Data Collection & Preparation (For All Seasons 2000-2001 to 2018-2019)¶

Basically, we do all of the steps covered in the example above for all 20 seasons. We will union the 20 player_winpct_df's into one large df (which is why we added the Year column to help us differentiate after the union).

The result of this procedure is what we will use for our upcoming EDA & Linear Regression analysis!

abbrev_df

-Re-create our abbrev_df using the scrape_abbrev() function from before. We will just use one abbrev-to-team table for all 20 seasons.

-Manually add the abbreviations and team names of the few teams that changed names/abbreviations from 2000-2019, so that the merges with and between the player_df and team_df tables happen smoothly.

In [8]:
# Get the current-day abbrev-->team pairs
abbrev_df = scrape_abbrev()

Abbrev Franchise
1 ATL Atlanta Hawks
2 BKN Brooklyn Nets
3 BOS Boston Celtics
4 CHA Charlotte Hornets
5 CHI Chicago Bulls
6 CLE Cleveland Cavaliers
7 DAL Dallas Mavericks
8 DEN Denver Nuggets
9 DET Detroit Pistons
10 GSW Golden State Warriors
11 HOU Houston Rockets
12 IND Indiana Pacers
13 LAC Los Angeles Clippers
14 LAL Los Angeles Lakers
15 MEM Memphis Grizzlies
16 MIA Miami Heat
17 MIL Milwaukee Bucks
18 MIN Minnesota Timberwolves
19 NOP New Orleans Pelicans
20 NYK New York Knicks
21 OKC Oklahoma City Thunder
22 ORL Orlando Magic
24 PHX Phoenix Suns
25 POR Portland Trail Blazers
26 SAC Sacramento Kings
27 SAS San Antonio Spurs
28 TOR Toronto Raptors
29 UTA Utah Jazz
30 WAS Washington Wizards
In [9]:
# Add some manual rows of teams with different abbreviations and names in the past
manual_abbrevs = [
# Current Teams w/ Diff Abbrev in our dataset
['BRK', 'Brooklyn Nets'],
['PHO', 'Phoenix Suns'],
['CHO', 'Charlotte Hornets'],
# Teams that no longer exist
['VAN','Vancouver Grizzlies'],
['SEA', 'Seattle SuperSonics'],
['CHH', 'Charlotte Hornets'],
['NJN', 'New Jersey Nets'],
['NOH', 'New Orleans Hornets'],
['NOK', 'New Orleans/Oklahoma City Hornets'],
['CHA', 'Charlotte Bobcats'], # Need to remove existing mapping of CHA!
]

# Remove current mapping for 'CHA', 'BKN', and 'PHX', since those
# abbrevs dont exist or mean something else in the basketballreference.com database
abbrev_df = abbrev_df[abbrev_df['Abbrev'] != 'CHA']
abbrev_df = abbrev_df[abbrev_df['Abbrev'] != 'BKN']
abbrev_df = abbrev_df[abbrev_df['Abbrev'] != 'PHX']

other_abbrevs=pd.DataFrame(manual_abbrevs, columns=["Abbrev", "Franchise"])

abbrev_df=abbrev_df.append(other_abbrevs, ignore_index=True)
display(abbrev_df)

Abbrev Franchise
0 ATL Atlanta Hawks
1 BOS Boston Celtics
2 CHI Chicago Bulls
3 CLE Cleveland Cavaliers
4 DAL Dallas Mavericks
5 DEN Denver Nuggets
6 DET Detroit Pistons
7 GSW Golden State Warriors
8 HOU Houston Rockets
9 IND Indiana Pacers
10 LAC Los Angeles Clippers
11 LAL Los Angeles Lakers
12 MEM Memphis Grizzlies
13 MIA Miami Heat
14 MIL Milwaukee Bucks
15 MIN Minnesota Timberwolves
16 NOP New Orleans Pelicans
17 NYK New York Knicks
18 OKC Oklahoma City Thunder
19 ORL Orlando Magic
21 POR Portland Trail Blazers
22 SAC Sacramento Kings
23 SAS San Antonio Spurs
24 TOR Toronto Raptors
25 UTA Utah Jazz
26 WAS Washington Wizards
27 BRK Brooklyn Nets
28 PHO Phoenix Suns
29 CHO Charlotte Hornets
30 VAN Vancouver Grizzlies
31 SEA Seattle SuperSonics
32 CHH Charlotte Hornets
33 NJN New Jersey Nets
34 NOH New Orleans Hornets
35 NOK New Orleans/Oklahoma City Hornets
36 CHA Charlotte Bobcats

player_df, team_df, and player_winpct_df for all 20 seasons:

In [10]:
# Generic URLs that we can format a year into, so we can iterate through the 20 years

# Empty Dataframe that we will keep appending with the results of each year's scraping + data prep
final_player_winpct_df = pd.DataFrame()

# Iterate from year 2001 (inclusive) to 2020 (exclusive) so that we can retrieve the right data
for year in range(2001,2020):
curr_player_avg_url = player_avgs_url.format(year)
curr_team_standing_url = team_standings_url.format(year)

# Data Collection
player_df = scrape_players(curr_player_avg_url)
team_df = scrape_standings(curr_team_standing_url)

# Data Prep
player_df = dataprep_players(player_df)
team_df = dataprep_teams(team_df, abbrev_df)

# Combine into one df
player_winpct_df = dataprep_combined(player_df, team_df, year)

# Append to our final df
final_player_winpct_df=final_player_winpct_df.append(player_winpct_df, ignore_index=True)

# Convert all numeric columns from string to float to make sure our plotting and Linear Regressions run smoothly
for col_name in final_player_winpct_df.columns[3:]:
final_player_winpct_df[col_name] = final_player_winpct_df[col_name].astype(float)

In [11]:
print(final_player_winpct_df.shape)
# Make sure all teams are represented
# Should be 37, since that is how many we had in our abbrev_df
print(len(final_player_winpct_df['Tm'].unique()))
# Another way to check is to make sure the following equality is true
# This makes sure the abbrevs from abbrev_df is the same as the abbrevs
# actually captured from our 20 seasons on BasketballReference.com
print(set(abbrev_df['Abbrev']) == (set(final_player_winpct_df['Tm'].unique())))
display(final_player_winpct_df)

(3510, 22)
37
True

Year Player Tm start_pct MP PTS TRB AST FGA FG% ... 2PA 2P% FTA FT% STL BLK TOV PF start_pct win_pct
0 2001 Shareef Abdur-Rahim VAN 1.000000 40.0 20.5 9.1 3.1 15.8 0.472 ... 15.0 0.487 6.6 0.834 1.1 1.0 2.9 2.9 1.000000 0.280488
1 2001 Mike Bibby VAN 1.000000 38.9 15.9 3.7 8.4 14.1 0.454 ... 10.6 0.478 2.3 0.761 1.3 0.1 3.0 1.8 1.000000 0.280488
2 2001 Michael Dickerson VAN 0.985714 37.4 16.3 3.3 3.3 14.6 0.417 ... 11.3 0.429 3.9 0.763 0.9 0.4 2.3 3.0 0.985714 0.280488
3 2001 Othella Harrington VAN 0.909091 28.8 10.9 6.6 0.8 8.8 0.466 ... 8.7 0.470 3.5 0.779 0.4 0.6 2.4 3.1 0.909091 0.280488
4 2001 Bryant Reeves VAN 0.640000 24.4 8.3 6.0 1.1 7.4 0.460 ... 7.3 0.462 1.9 0.796 0.6 0.7 1.2 3.2 0.640000 0.280488
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3505 2019 Reggie Bullock DET 1.000000 30.8 12.1 2.8 2.5 10.0 0.413 ... 3.3 0.463 1.5 0.875 0.5 0.1 1.2 1.8 1.000000 0.500000
3506 2019 Andre Drummond DET 1.000000 33.5 17.3 15.6 1.4 13.3 0.533 ... 12.8 0.548 5.2 0.590 1.7 1.7 2.2 3.4 1.000000 0.500000
3507 2019 Wayne Ellington DET 0.928571 27.3 12.0 2.1 1.5 9.8 0.421 ... 2.0 0.607 1.2 0.758 1.1 0.1 0.9 1.9 0.928571 0.500000
3508 2019 Blake Griffin DET 1.000000 35.0 24.5 7.5 5.4 17.9 0.462 ... 10.9 0.525 7.3 0.753 0.7 0.4 3.4 2.7 1.000000 0.500000
3509 2019 Reggie Jackson DET 1.000000 27.9 15.4 2.6 4.2 12.8 0.421 ... 7.0 0.464 2.9 0.864 0.7 0.1 1.8 2.5 1.000000 0.500000

3510 rows × 22 columns

## Want to try out this dataset yourself?¶

With the code below, I loaded final_player_winpct_df to a csv file and uploaded it online.

In [12]:
#final_player_winpct_df.to_csv("NBA Reg Season Player Avgs with Win Pct 2000-2019.csv")


## EDA (Exploratory Data Analysis)¶

Plot Violin Plot of 3PA (3-Pointers Attempted Per Game) stat against time to visually see if there is a linear trend (for all 20 years).

Helps to answer the initial question of whether 3-point shooting is actually more popular in the 2015-2020 era than previous eras.

In [13]:
# Create a new column Year_Short to be used when plotting so that the plot is cleaner
# Basically just the Year column with only the last 2 digits.
final_player_winpct_df['Year Short'] = final_player_winpct_df['Year']
def apply_yr_short(x):
# Just use last 2 digits
return x%100
final_player_winpct_df['Year Short'] = final_player_winpct_df['Year Short'].apply(apply_yr_short)

In [14]:
sns.violinplot(x='Year Short',y='3PA', data=final_player_winpct_df)
plt.title("3-Pointers Attempted Per Game Over Time")
plt.xlabel("Year (2000-2019)")
plt.ylabel("3-Pointers Attempted Per Game")

Out[14]:
Text(0, 0.5, '3-Pointers Attempted Per Game')

The white dots in each violin represent the Median 3-Pt Attempts Per Game for that year. By following these white dots, we can see there is a linear positive trend. There is relatively steep increase from the year 2013 to 2014 and onwards, which is right around the time the "Three Point Revolution" began to be noticed and implemented across the league.

The violins also show that 3-Pt attempts were right-skewed in the early 2000s, since the violins were fatter and wider towards the bottom and became skinner at the larger 3-Pt Attempt numbers on the y-axis. This means that most players were concentrated around a lower # of 3-Pt shots attempted per game, and a relatively few number of players took many 3-Pt shots.

This right-skewed pattern continues until 2014, when the violins start to look bi-modal for 2014, 2015, and 2016. This means there is a large concentration of players taking few 3-Pt shots and an equally large concentration taking relatively more shots, with few in the middle.

In 2017, 2018, and 2019, the violins become very skinny and don't have any noticeable peaks/skewness, which represents a more uniform distribution. By 2019, we can start to see a new trend emerging where there is a single peak/fatness in the middle of the violin near the median.

(The needle-like tops of the violins are caused by a few specialist outlier players that are likely star players or very efficient 3-Pt marksman that are allowed to take more 3-pointers than most players in the league. We can see these players have existed to some extent in all years. The extent to which these specialist players shoot more 3s is different, however. We can see the outlier specialist players in 2016 and onwards take considerably more 3s than the outlier specialist players from years like 2009-2012.)

We will have to do a formal Linear Regression and t-test to determine if this linear trend from about 2 3-Pt Attempts on average (median from 2001 violin) to about 4 3-Pt Attempts on average (median from 2019 violin) is statistically significant.

Plot Violin Plot of 3P% (3-Pointer Make % or Efficiency) stat against time to visually see if there is a linear trend (for all 20 years).

Helps to see if NBA Starters have gotten more efficient at shooting 3-pointers in the 2015-2020 era alongside the growing 3-pointers attempted stat.

In [15]:
sns.violinplot(x='Year Short',y='3P%', data=final_player_winpct_df)
plt.title("3-Pointers Make % Over Time")
plt.xlabel("Year (2000-2019)")
plt.ylabel("3-Pointer Make %")

Out[15]:
Text(0, 0.5, '3-Pointer Make %')

The outliers in these violins make it harder to make out a linear trend. The outliers of players with near-100% and near-0% efficiency is caused by starters who take very few 3-pointers despite being starters. Some examples of SUPERSTAR-calibar players that could still fall under this outlier category are: Shaquille O'Neal, Dwight Howard, and Ben Simmons. These types of players probably took 1-2 3s an entire season and either made them all (100%) or made none (0%).

However, the median white dots show that median 3-Pt efficiency in 2001 was about 32%, and this figure was close to about 38% in 2019. From my domain knowledge, I think this change is pretty significant, especially since the peak, peak theoretical 3-Pt efficiency humanly possible is about 50% in my opinion (which is still really high). We will have to do a formal Linear Regression and t-test to confirm that this is statistically significant.

Over time, the violins become fatter towards the median and form a clear single peak. This indicates more and more players in the league are able to shoot at a similar 3-Pt efficiency around the median over time. The violins in the early 2000s, by comparison, are much skinnier and uniform, meaning 3-Pt efficiencies are all over the place.

I am not sure if "linear" is the best way to characterize these trends, however. We see the white dots go up from 2004-2009, then dip down until 2013, and then consistently go up until 2019. In future studies, it would be helpful to get data from more years (say 1980-2000) to see if these ups & downs are not significant in the overall trends.

## Hypothesis Testing & "Machine Learning"¶

-I will be relying heavily on Linear Regression as my "Machine Learning" tool for helping to test my questions & hypotheses.

-We can use t-tests on coefficieints of the Linear Regression results to test my prediction & questions.

Is there a linear relationship between Year and 3-Pointers Attempted?

Ho (Null Hypothesis): There is no relationship between Year and 3-Pointers Attempted.

or

Ho: B1 = 0 (where B1 is the coefficient for Year in the population-level linear regression model for 3PA~Year)

Ha: B1 /= 0

In [16]:
final_player_winpct_df.columns

Out[16]:
Index(['Year', 'Player', 'Tm', 'start_pct', 'MP', 'PTS', 'TRB', 'AST', 'FGA',
'FG%', '3PA', '3P%', '2PA', '2P%', 'FTA', 'FT%', 'STL', 'BLK', 'TOV',
'PF', 'start_pct', 'win_pct', 'Year Short'],
dtype='object')
In [17]:
# Only using 'Year' columnm, so we need to reshape to fit scikit's fit() function
regr_X = np.array(final_player_winpct_df['Year']).reshape(-1,1)
# Response is '3PA', or the average # of 3-pointers attempted Per Game
regr_y = final_player_winpct_df['3PA']

# Building a linear regression model using scikit's sklearn
regr = linear_model.LinearRegression()

# Calculating the parameters of our regression model using the fit() method
le_year_lin_model = regr.fit(X=regr_X, y=regr_y)

# Coefficient of year in our model
print("Coefficient of year in our model: ", le_year_lin_model.coef_)

# Intercept Value in our model
print("Intercept in our model: ", le_year_lin_model.intercept_)

# Coefficient of Determination Score
print("R^2 Score: ", regr.score(X=regr_X, y=regr_y))

Coefficient of year in our model:  [0.09300827]
Intercept in our model:  -184.53683490923714
R^2 Score:  0.056415071807798034


Run again with statsmodels.api OLS to double-check and get hypothesis testing statistics like t-statistic and p-value

In [18]:
# Need to add a column of 1s to create a constant term
# statsmodels.api does not do it for us like sklearn does

# Make into dataframes to make sure variable names are shown in output
summary_X = pd.DataFrame(summary_X).reset_index(drop=True)
summary_X.columns = ['Constant', 'Year']
summary_y = pd.DataFrame(regr_y).reset_index(drop=True)

summary_est = sm.OLS(summary_y, summary_X)

print(summary_est.fit().summary())

                            OLS Regression Results
==============================================================================
Dep. Variable:                    3PA   R-squared:                       0.056
Method:                 Least Squares   F-statistic:                     209.7
Date:                Tue, 15 Dec 2020   Prob (F-statistic):           3.29e-46
Time:                        19:57:15   Log-Likelihood:                -7546.5
No. Observations:                3510   AIC:                         1.510e+04
Df Residuals:                    3508   BIC:                         1.511e+04
Df Model:                           1
Covariance Type:            nonrobust
==============================================================================
coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Constant    -184.5368     12.909    -14.295      0.000    -209.848    -159.226
Year           0.0930      0.006     14.482      0.000       0.080       0.106
==============================================================================
Omnibus:                      167.273   Durbin-Watson:                   2.112
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              163.733
Skew:                           0.484   Prob(JB):                     2.79e-36
Kurtosis:                       2.574   Cond. No.                     7.40e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.4e+05. This might indicate that there are
strong multicollinearity or other numerical problems.


Based on the results of fitting our Linear Regression model above, I reject the null hypothesis at a 95% confidence level (alpha=0.05, a two-tailed t-test). The t-value for our predictor Year is 14.482, which is much greater than t-critical value (1.96) for this confidence level (t-critical values found here: https://www.stat.colostate.edu/inmem/gumina/st201/pdf/Utts-Heckard_t-Table.pdf). Additionally, having a p-value of 0.000 for our predictor Year is another indicator of rejecting the null hypothesis, and so is not having 0 within our 95% Confidence Interval of [0.080, 0.106].

Therefore, since the coefficient of Year is positive, we can interpret it as: "On average, 3-point shot attempts increase by 0.093 shots every year in the NBA from 2000-2019."

Therefore, our first hypothesis turned out to be true for our sample (only starters, only 2000-2019) !!!

Repeat the same as above to test whether there is a linear relationship between Year and 3-Point Shot Efficiency:

Ho (Null Hypothesis): There is no relationship between Year and 3-Pointer Efficiency.

or

Ho: B1 = 0 (where B1 is the coefficient for Year in the population-level linear regression model for 3P%~Year)

Ha: B1 /= 0

In [19]:
# Get columns for the regression
regr_Xy = final_player_winpct_df[['Year', '3P%']]
# Drop all rows with NaN for players that never took any 3-point shots
regr_Xy = regr_Xy[regr_Xy['3P%'].isnull() == False]

# Only using 'Year' columnm, so we need to reshape to fit scikit's fit() function
regr_X = np.array(regr_Xy['Year']).reshape(-1,1)

# Response is '3P%', or the 3-Pointer Make % Per Game
regr_y = regr_Xy['3P%']

# Building a linear regression model using scikit's sklearn
regr = linear_model.LinearRegression()

# Calculating the parameters of our regression model using the fit() method
le_year_lin_model = regr.fit(X=regr_X, y=regr_y)

# Coefficient of year in our model
print("Coefficient of year in our model: ", le_year_lin_model.coef_)

# Intercept Value in our model
print("Intercept in our model: ", le_year_lin_model.intercept_)

# Coefficient of Determination Score
print("R^2 Score: ", regr.score(X=regr_X, y=regr_y))

Coefficient of year in our model:  [0.00299182]
Intercept in our model:  -5.716974555534372
R^2 Score:  0.012910979284530444


Run again with statsmodels.api OLS to double-check and get hypothesis testing statistics like t-statistic and p-value

In [20]:
# Need to add a column of 1s to create a constant term
# statsmodels.api does not do it for us like sklearn does

# Make into dataframes to make sure variable names are shown in output
summary_X = pd.DataFrame(summary_X).reset_index(drop=True)
summary_X.columns = ['Constant', 'Year']
summary_y = pd.DataFrame(regr_y).reset_index(drop=True)

summary_est = sm.OLS(summary_y, summary_X)

print(summary_est.fit().summary())

                            OLS Regression Results
==============================================================================
Dep. Variable:                    3P%   R-squared:                       0.013
Method:                 Least Squares   F-statistic:                     41.86
Date:                Tue, 15 Dec 2020   Prob (F-statistic):           1.13e-10
Time:                        19:57:15   Log-Likelihood:                 1677.0
No. Observations:                3202   AIC:                            -3350.
Df Residuals:                    3200   BIC:                            -3338.
Df Model:                           1
Covariance Type:            nonrobust
==============================================================================
coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Constant      -5.7170      0.930     -6.150      0.000      -7.540      -3.894
Year           0.0030      0.000      6.470      0.000       0.002       0.004
==============================================================================
Omnibus:                      285.721   Durbin-Watson:                   2.108
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1382.180
Skew:                          -0.289   Prob(JB):                    7.30e-301
Kurtosis:                       6.166   Cond. No.                     7.38e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.38e+05. This might indicate that there are
strong multicollinearity or other numerical problems.


I reject the null hypothesis at a 95% confidence level (alpha=0.05, a two-tailed t-test) since we have a p-value of 0.000 for our predictor Year and 0 is not within our 95% Confidence Interval of [0.002, 0.004].

Therefore, since the coefficient of Year is positive, we can interpret it as: "On average, 3-point shot accuracy increases by 0.003 (or 0.3%) every year in the NBA from 2000-2019". As mentioned before, this yearly increase of 0.3% is HUGE in the context of this domain because even an overall 3-pt shooting efficiency of 40% is considered GREAT by today's standards.

Therefore, our second hypothesis turned out to be true for our sample (only starters, only 2000-2019) !!! Starters in the NBA are starting to get more efficient at shooting 3s alongside attempting more of them.

Now, we will run a Multiple Linear Regression in each of the 4 eras to see if 3PA & 3P% are still significant predictors of Team Win% when considering all of the other player statistics:

Basically, we want to see what are the favorable stats in a player that lead to a higher win-pct, and whether 3-point attempts and efficiency is a significant player in that equation.

We need to do a bit of data preparation first. We will first need to separate our total dataset into 4 smaller datasets for each era. We will also make some key transformations and standardizations to the stats, and also trim down which stats we focus on. As before, we will walk through all of the transformations with era1 first as an example, and then do it all at once for the other 3 eras in a for-loop

First, let's filter final_player_winpct_df to just era 1 (2000-2004)

In [21]:
# The 2000-2004 era
era1_end_2004 = final_player_winpct_df[final_player_winpct_df['Year'].isin([2001,2002,2003,2004])]
# print the unique years seen in this df to make sure we only use the years we intended
print(era1_end_2004['Year'].unique())

[2001 2002 2003 2004]


We don't care about Year, Player (the player's name), Tm, or start_pct, since these don't help us answer which player metrics lead to more wins

In [22]:
era1_end_2004 = era1_end_2004.drop(labels=['Year', 'Player', 'Tm', 'start_pct', 'Year Short'], axis=1)

Out[22]:
MP PTS TRB AST FGA FG% 3PA 3P% 2PA 2P% FTA FT% STL BLK TOV PF win_pct
0 40.0 20.5 9.1 3.1 15.8 0.472 0.8 0.188 15.0 0.487 6.6 0.834 1.1 1.0 2.9 2.9 0.280488
1 38.9 15.9 3.7 8.4 14.1 0.454 3.5 0.379 10.6 0.478 2.3 0.761 1.3 0.1 3.0 1.8 0.280488
2 37.4 16.3 3.3 3.3 14.6 0.417 3.3 0.374 11.3 0.429 3.9 0.763 0.9 0.4 2.3 3.0 0.280488
3 28.8 10.9 6.6 0.8 8.8 0.466 0.1 0.000 8.7 0.470 3.5 0.779 0.4 0.6 2.4 3.1 0.280488
4 24.4 8.3 6.0 1.1 7.4 0.460 0.1 0.250 7.3 0.462 1.9 0.796 0.6 0.7 1.2 3.2 0.280488

We will standardize tally stats like PTS, TRB (total rebounds), AST (assists), STL (steals), BLK (blocks), TOV (turnovers), and PF (personal fouls) by dividing them by the minutes played (or MP). This will help show how efficient each player is in each of these stats.

We don't want our Linear Regression model to just point out obvious facts like "scoring more points leads to more victories", so we will use efficiency metrics as our predictors instead.

Drop the MP column after this standardization, as it no longer serves any purpose.

Rename the columns we just standardized to indicate that these are now efficiency metric. Ex: PT --> efPT.

In [23]:
# Standardize by converting these to efficiency metrics
era1_end_2004['PTS'] = era1_end_2004['PTS'] / era1_end_2004['MP']
era1_end_2004['TRB'] = era1_end_2004['TRB'] / era1_end_2004['MP']
era1_end_2004['AST'] = era1_end_2004['AST'] / era1_end_2004['MP']
era1_end_2004['STL'] = era1_end_2004['STL'] / era1_end_2004['MP']
era1_end_2004['BLK'] = era1_end_2004['BLK'] / era1_end_2004['MP']
era1_end_2004['TOV'] = era1_end_2004['TOV'] / era1_end_2004['MP']
era1_end_2004['PF'] = era1_end_2004['PF'] / era1_end_2004['MP']

# Drop MP column
era1_end_2004 = era1_end_2004.drop(labels=['MP'], axis=1)

# Rename columns
new_names = {'PTS': 'efPTS',
'TRB': 'efTRB',
'AST': 'efAST',
'STL': 'efSTL',
'BLK': 'efBLK',
'TOV': 'efTOV',
'PF': 'efPF'}
era1_end_2004 = era1_end_2004.rename(columns=new_names)


Out[23]:
efPTS efTRB efAST FGA FG% 3PA 3P% 2PA 2P% FTA FT% efSTL efBLK efTOV efPF win_pct
0 0.512500 0.227500 0.077500 15.8 0.472 0.8 0.188 15.0 0.487 6.6 0.834 0.027500 0.025000 0.072500 0.072500 0.280488
1 0.408740 0.095116 0.215938 14.1 0.454 3.5 0.379 10.6 0.478 2.3 0.761 0.033419 0.002571 0.077121 0.046272 0.280488
2 0.435829 0.088235 0.088235 14.6 0.417 3.3 0.374 11.3 0.429 3.9 0.763 0.024064 0.010695 0.061497 0.080214 0.280488
3 0.378472 0.229167 0.027778 8.8 0.466 0.1 0.000 8.7 0.470 3.5 0.779 0.013889 0.020833 0.083333 0.107639 0.280488
4 0.340164 0.245902 0.045082 7.4 0.460 0.1 0.250 7.3 0.462 1.9 0.796 0.024590 0.028689 0.049180 0.131148 0.280488

Drop the FGA and FG% columns, as those columns can be derived from 3PA, 3P%, 2PA, and 2P%. FGA and FG% are basically the attempts and efficiency metrics for "Field Goals", which is just a term to describe non-Free Throw points earned (3-pointers and 2-pointers).

Keeping these columns could cause severe multicollinearity, which adds noise to our final Linear Regression. Read more about multicollinearity here: https://www.statisticshowto.com/multicollinearity/. Being able to transform other predictor variables into FGA and FG% makes these variables the worst case scenario for multicollinearity, since the relationship is exact, direct, and CAUSAL.

For the purposes of our study, we would also prefer to isolate 3PA and 3P% as much as possible, and remove predictors like FGA and FG% that incorporate 3-pointer metrics in its derivation.

In [24]:
era1_end_2004 = era1_end_2004.drop(labels=['FGA', 'FG%'], axis=1)

Out[24]:
efPTS efTRB efAST 3PA 3P% 2PA 2P% FTA FT% efSTL efBLK efTOV efPF win_pct
0 0.512500 0.227500 0.077500 0.8 0.188 15.0 0.487 6.6 0.834 0.027500 0.025000 0.072500 0.072500 0.280488
1 0.408740 0.095116 0.215938 3.5 0.379 10.6 0.478 2.3 0.761 0.033419 0.002571 0.077121 0.046272 0.280488
2 0.435829 0.088235 0.088235 3.3 0.374 11.3 0.429 3.9 0.763 0.024064 0.010695 0.061497 0.080214 0.280488
3 0.378472 0.229167 0.027778 0.1 0.000 8.7 0.470 3.5 0.779 0.013889 0.020833 0.083333 0.107639 0.280488
4 0.340164 0.245902 0.045082 0.1 0.250 7.3 0.462 1.9 0.796 0.024590 0.028689 0.049180 0.131148 0.280488

Drop 3PA, 2PA, and FTA since we really only care about efficiency. The attempts of 3-pointers, 2-pointers, and Free Throws is more dependent on the number of minutes a player is allowed to play, so we want to remove this factor to standardize things.

We still have the % (efficiency) columns for these 3 metrics.

In [25]:
era1_end_2004 = era1_end_2004.drop(labels=['3PA', '2PA', 'FTA'], axis=1)

Out[25]:
efPTS efTRB efAST 3P% 2P% FT% efSTL efBLK efTOV efPF win_pct
0 0.512500 0.227500 0.077500 0.188 0.487 0.834 0.027500 0.025000 0.072500 0.072500 0.280488
1 0.408740 0.095116 0.215938 0.379 0.478 0.761 0.033419 0.002571 0.077121 0.046272 0.280488
2 0.435829 0.088235 0.088235 0.374 0.429 0.763 0.024064 0.010695 0.061497 0.080214 0.280488
3 0.378472 0.229167 0.027778 0.000 0.470 0.779 0.013889 0.020833 0.083333 0.107639 0.280488
4 0.340164 0.245902 0.045082 0.250 0.462 0.796 0.024590 0.028689 0.049180 0.131148 0.280488

Last bit of Data Prep:

Drop all rows with NaN for 3P%, 2P%, and FT% (players that never took any 3-pt shots, 2-pt shots, or free throws).

In [26]:
print("Before: ", era1_end_2004.shape)
era1_end_2004 = era1_end_2004[era1_end_2004['3P%'].isnull() == False]
era1_end_2004 = era1_end_2004[era1_end_2004['2P%'].isnull() == False]
era1_end_2004 = era1_end_2004[era1_end_2004['FT%'].isnull() == False]
print("After: ", era1_end_2004.shape)

Before:  (707, 11)
After:  (627, 11)


Now, we're ready to run our Multiple Linear Regression with win_pct as the response, and all other columns above as the predictors.

In [27]:
# Select all columns except win_pct, which is our response
regr_X = era1_end_2004.loc[:, era1_end_2004.columns != 'win_pct']

# Response is win_pct
regr_y = era1_end_2004['win_pct']

# Need to add a column of 1s to create a constant term
# statsmodels.api does not do it for us like sklearn does

# Make into dataframes to make sure variable names are shown in output
summary_X = pd.DataFrame(summary_X).reset_index(drop=True)
summary_X.columns =  ['Constant'] + list(era1_end_2004.columns[:-1])
summary_y = pd.DataFrame(regr_y).reset_index(drop=True)

summary_est = sm.OLS(summary_y, summary_X)

print(summary_est.fit().summary())

                            OLS Regression Results
==============================================================================
Dep. Variable:                win_pct   R-squared:                       0.183
Method:                 Least Squares   F-statistic:                     13.81
Date:                Tue, 15 Dec 2020   Prob (F-statistic):           3.99e-22
Time:                        19:57:15   Log-Likelihood:                 379.33
No. Observations:                 627   AIC:                            -736.7
Df Residuals:                     616   BIC:                            -687.8
Df Model:                          10
Covariance Type:            nonrobust
==============================================================================
coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Constant       0.0599      0.091      0.658      0.511      -0.119       0.239
efPTS          0.1887      0.067      2.815      0.005       0.057       0.320
efTRB          0.1524      0.119      1.281      0.201      -0.081       0.386
efAST          0.7478      0.146      5.111      0.000       0.460       1.035
3P%            0.0096      0.043      0.224      0.823      -0.074       0.093
2P%            0.9283      0.144      6.466      0.000       0.646       1.210
FT%           -0.0370      0.072     -0.514      0.608      -0.178       0.104
efSTL          0.6727      0.488      1.379      0.168      -0.285       1.631
efBLK          0.4282      0.411      1.042      0.298      -0.379       1.236
efTOV         -2.8124      0.450     -6.251      0.000      -3.696      -1.929
efPF          -0.1902      0.283     -0.672      0.502      -0.746       0.366
==============================================================================
Omnibus:                       22.496   Durbin-Watson:                   0.682
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               15.251
Skew:                          -0.259   Prob(JB):                     0.000488
Kurtosis:                       2.438   Cond. No.                         135.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

C:\Users\pratb\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py:2389: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
return ptp(axis=axis, out=out, **kwargs)


We will analyze these results later!

Now, we will package the data prep and modeling code we just did in our example into convenient functions that will allow us to run this Linear Regression Model across all eras.

In [28]:
def dataprep_linreg(era):
era_local = era.copy()

# Drop useless columns
era_local = era_local.drop(labels=['Year', 'Player', 'Tm', 'start_pct', 'Year Short'], axis=1)

# Standardize by converting these to efficiency metrics
era_local['PTS'] = era_local['PTS'] / era_local['MP']
era_local['TRB'] = era_local['TRB'] / era_local['MP']
era_local['AST'] = era_local['AST'] / era_local['MP']
era_local['STL'] = era_local['STL'] / era_local['MP']
era_local['BLK'] = era_local['BLK'] / era['MP']
era_local['TOV'] = era_local['TOV'] / era_local['MP']
era_local['PF'] = era_local['PF'] / era_local['MP']

# Drop MP column
era_local = era_local.drop(labels=['MP'], axis=1)

# Rename columns
new_names = {'PTS': 'efPTS',
'TRB': 'efTRB',
'AST': 'efAST',
'STL': 'efSTL',
'BLK': 'efBLK',
'TOV': 'efTOV',
'PF': 'efPF'}
era_local = era_local.rename(columns=new_names)

# Drop FG-related columns to avoid multicollinearity
era_local = era_local.drop(labels=['FGA', 'FG%'], axis=1)

# Drop Attempt columns, as we care mostly about efficiency
era_local = era_local.drop(labels=['3PA', '2PA', 'FTA'], axis=1)

# Remove NaN values in columns where players took no attempts
era_local = era_local[era_local['3P%'].isnull() == False]
era_local = era_local[era_local['2P%'].isnull() == False]
era_local = era_local[era_local['FT%'].isnull() == False]

return pd.DataFrame(era_local)

In [29]:
def runlinreg(era):
# Select all columns except win_pct, which is our response
regr_X = era.loc[:, era.columns != 'win_pct']

# Response is win_pct
regr_y = era['win_pct']

# Need to add a column of 1s to create a constant term
# statsmodels.api does not do it for us like sklearn does

# Make into dataframes to make sure variable names are shown in output
summary_X = pd.DataFrame(summary_X).reset_index(drop=True)
summary_X.columns =  ['Constant'] + list(era.columns[:-1])
summary_y = pd.DataFrame(regr_y).reset_index(drop=True)

summary_est = sm.OLS(summary_y, summary_X)

print(summary_est.fit().summary())


Finally, lets run the Linear Regression on all eras and see the output!

In [30]:
# The 2000-2004 era
era1_end_2004 = final_player_winpct_df[final_player_winpct_df['Year'].isin([2001,2002,2003,2004])]

# The 2005-2009 era
era2_end_2009 = final_player_winpct_df[final_player_winpct_df['Year'].isin([2005,2006,2007,2008, 2009])]

# The 2010-2014 era
era3_end_2014 = final_player_winpct_df[final_player_winpct_df['Year'].isin([2010,2011,2012,2013,2014])]

# The 2015-2019 era
era4_end_2019 = final_player_winpct_df[final_player_winpct_df['Year'].isin([2015,2016,2017,2018,2019])]

In [31]:
# Put all era's dfs in a list
eras = [era1_end_2004, era2_end_2009, era3_end_2014, era4_end_2019]

for idx, era in enumerate(eras):
era = dataprep_linreg(era)

print("ERA", idx+1, "LINEAR REGRESSION RESULTS: ")
runlinreg(era)
print()

ERA 1 LINEAR REGRESSION RESULTS:
OLS Regression Results
==============================================================================
Dep. Variable:                win_pct   R-squared:                       0.183
Method:                 Least Squares   F-statistic:                     13.81
Date:                Tue, 15 Dec 2020   Prob (F-statistic):           3.99e-22
Time:                        19:57:15   Log-Likelihood:                 379.33
No. Observations:                 627   AIC:                            -736.7
Df Residuals:                     616   BIC:                            -687.8
Df Model:                          10
Covariance Type:            nonrobust
==============================================================================
coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Constant       0.0599      0.091      0.658      0.511      -0.119       0.239
efPTS          0.1887      0.067      2.815      0.005       0.057       0.320
efTRB          0.1524      0.119      1.281      0.201      -0.081       0.386
efAST          0.7478      0.146      5.111      0.000       0.460       1.035
3P%            0.0096      0.043      0.224      0.823      -0.074       0.093
2P%            0.9283      0.144      6.466      0.000       0.646       1.210
FT%           -0.0370      0.072     -0.514      0.608      -0.178       0.104
efSTL          0.6727      0.488      1.379      0.168      -0.285       1.631
efBLK          0.4282      0.411      1.042      0.298      -0.379       1.236
efTOV         -2.8124      0.450     -6.251      0.000      -3.696      -1.929
efPF          -0.1902      0.283     -0.672      0.502      -0.746       0.366
==============================================================================
Omnibus:                       22.496   Durbin-Watson:                   0.682
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               15.251
Skew:                          -0.259   Prob(JB):                     0.000488
Kurtosis:                       2.438   Cond. No.                         135.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

ERA 2 LINEAR REGRESSION RESULTS:
OLS Regression Results
==============================================================================
Dep. Variable:                win_pct   R-squared:                       0.166
Method:                 Least Squares   F-statistic:                     16.33
Date:                Tue, 15 Dec 2020   Prob (F-statistic):           4.32e-27
Time:                        19:57:15   Log-Likelihood:                 465.86
No. Observations:                 833   AIC:                            -909.7
Df Residuals:                     822   BIC:                            -857.7
Df Model:                          10
Covariance Type:            nonrobust
==============================================================================
coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Constant       0.0885      0.080      1.106      0.269      -0.069       0.246
efPTS          0.1833      0.057      3.240      0.001       0.072       0.294
efTRB          0.0704      0.102      0.688      0.491      -0.130       0.271
efAST          0.7331      0.129      5.687      0.000       0.480       0.986
3P%            0.0619      0.037      1.658      0.098      -0.011       0.135
2P%            0.8890      0.112      7.965      0.000       0.670       1.108
FT%           -0.0530      0.066     -0.807      0.420      -0.182       0.076
efSTL          0.3542      0.457      0.775      0.439      -0.543       1.252
efBLK          0.9185      0.351      2.620      0.009       0.230       1.607
efTOV         -3.0317      0.404     -7.502      0.000      -3.825      -2.238
efPF          -0.1680      0.247     -0.681      0.496      -0.652       0.316
==============================================================================
Omnibus:                        8.185   Durbin-Watson:                   0.598
Prob(Omnibus):                  0.017   Jarque-Bera (JB):                5.651
Skew:                          -0.040   Prob(JB):                       0.0593
Kurtosis:                       2.604   Cond. No.                         139.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

ERA 3 LINEAR REGRESSION RESULTS:
OLS Regression Results
==============================================================================
Dep. Variable:                win_pct   R-squared:                       0.113
Method:                 Least Squares   F-statistic:                     10.60
Date:                Tue, 15 Dec 2020   Prob (F-statistic):           4.90e-17
Time:                        19:57:15   Log-Likelihood:                 420.87
No. Observations:                 844   AIC:                            -819.7
Df Residuals:                     833   BIC:                            -767.6
Df Model:                          10
Covariance Type:            nonrobust
==============================================================================
coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Constant       0.0739      0.084      0.876      0.381      -0.092       0.239
efPTS          0.1381      0.063      2.203      0.028       0.015       0.261
efTRB          0.0224      0.107      0.210      0.834      -0.187       0.232
efAST          0.3370      0.141      2.395      0.017       0.061       0.613
3P%            0.1328      0.043      3.096      0.002       0.049       0.217
2P%            0.6828      0.121      5.636      0.000       0.445       0.921
FT%            0.0294      0.069      0.428      0.669      -0.105       0.164
efSTL          1.1579      0.483      2.396      0.017       0.209       2.106
efBLK          1.4675      0.422      3.478      0.001       0.639       2.296
efTOV         -1.4354      0.437     -3.281      0.001      -2.294      -0.577
efPF          -0.6758      0.282     -2.396      0.017      -1.229      -0.122
==============================================================================
Omnibus:                       46.129   Durbin-Watson:                   0.568
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               18.473
Skew:                          -0.058   Prob(JB):                     9.74e-05
Kurtosis:                       2.284   Cond. No.                         143.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

ERA 4 LINEAR REGRESSION RESULTS:
OLS Regression Results
==============================================================================
Dep. Variable:                win_pct   R-squared:                       0.162
Method:                 Least Squares   F-statistic:                     17.03
Date:                Tue, 15 Dec 2020   Prob (F-statistic):           1.81e-28
Time:                        19:57:15   Log-Likelihood:                 497.24
No. Observations:                 893   AIC:                            -972.5
Df Residuals:                     882   BIC:                            -919.7
Df Model:                          10
Covariance Type:            nonrobust
==============================================================================
coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Constant       0.0202      0.071      0.285      0.776      -0.119       0.159
efPTS          0.1270      0.056      2.275      0.023       0.017       0.237
efTRB          0.1147      0.085      1.350      0.177      -0.052       0.281
efAST          0.7271      0.129      5.630      0.000       0.474       0.981
3P%            0.1296      0.045      2.856      0.004       0.041       0.219
2P%            0.7051      0.091      7.715      0.000       0.526       0.884
FT%            0.0551      0.059      0.931      0.352      -0.061       0.171
efSTL          0.5358      0.402      1.334      0.183      -0.253       1.324
efBLK          0.9468      0.392      2.417      0.016       0.178       1.716
efTOV         -2.1727      0.397     -5.470      0.000      -2.952      -1.393
efPF          -0.3740      0.277     -1.353      0.177      -0.917       0.169
==============================================================================
Omnibus:                       15.168   Durbin-Watson:                   0.595
Prob(Omnibus):                  0.001   Jarque-Bera (JB):                9.989
Skew:                          -0.115   Prob(JB):                      0.00677
Kurtosis:                       2.536   Cond. No.                         137.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.



Results

By using our Hypothesis Testing methodology from before, I determined the following predictors reject the null hypothesis and are significant in predicting win_pct when holding all other predictors constant. (I did this by just selecting the predictors in each era's Linear Regression summary that had a p-value < 0.05):

Era 1 (2000-2004):

efPTS (p-val = 0.005)

efAST (p-val = 0.000)

2P% (p-val = 0.000)

efTOV (p-val = 0.000)

Era 2 (2005-2009):

efPTS (p-val = 0.001)

efAST (p-val = 0.000)

2P% (p-val = 0.000)

efBLK (p-val = 0.009)

efTOV (p-val = 0.000)

Era 3 (2010-2014):

efPTS (p-val = 0.028)

efAST (p-val = 0.017)

3P% (p-val = 0.002)

2P% (p-val = 0.000)

efSTL (p-val = 0.017)

efBLK (p-val = 0.001)

efTOV (p-val = 0.001)

efPF (p-val = 0.017)

Era 4 (2015-2019):

efPTS (p-val = 0.023)

efAST (p-val = 0.000)

3P% (p-val = 0.004)

2P% (p-val = 0.000)

efBLK (p-val = 0.016)

efTOV (p-val = 0.000)

Conclusion Preamble

Before addressing our questions from the beginning of this study, let's discuss how to interpret and compare the Linear Regression results from the 4 eras. It is hard to use coefficients to compare across eras because, although they are all efficiency metrics, the 3P%, 2P%, and FT% predictors use shots taken as the standardizer, while the other predictors use minutes played. Additionally, these coefficients are hard to use and interpret like we did with our Simple Linear Regression with just one predictor. This is because all of these predictors are in the range 0-1 (0%-100%), and our standard tactic of talking about the change in response after 1 unit of increase in the predictors does not make sense because 1 unit of increase is a whole 100%.

Let's just keep things simple and use the sign of the coefficients (positive or negative) and the magnitude of the p-value (analogous to and derived from the t-statistic) to help make comparisons and observations. A smaller p-value indicates STRONGER evidence to reject the null hypothesis, making predictors with relatively smaller p-values less likely to not actually be a significant predictor compared to other predictors. Read more about the exact definition and interpretation of p-values here: https://www.statsdirect.com/help/basics/p_values.htm.

Also, it is important to remember that the efPTS predictor is not highly-correlated with 3P%, 2P%, and FT%, since the former is based on minutes played, and the latter is based on shot attempts. Obviously, minutes played and shot attempts are themselves related, but overall this is not as big of an issue as predictors directly derivable from other predictors.

CONCLUSION

My hypothesis that 3-point efficiency would be a significant predictor of success in ALL ERAS was WRONG! As per the results above, 3-point efficiency only became a significant predictor in eras 3 & 4 (2010-2019). I think this result is indicative of the growing power of the 3-point shot. In eras 1 & 2 (2000-2009), the list of significant predictors was much smaller than later eras, and was largely made up of predictors that strictly related to just putting the ball in the hoop to score more points to win more games (points scoring efficiency, 2-pointer scoring efficiency, assists efficiency). In the more recent eras, however, 3-point efficiency started to become significant in its own right even in the context of these staple predictors.

My other hypothesis regarding to what extent 3-point efficiency DOMINATES other predictors is hard to tell with this approach. Although 3P% does have a lower p-value compared to other predictors in the later eras (3 & 4), I still don't feel comfortable using that as a definitive measure. However, I can at least see that 3-point efficiency has become MORE dominant/important as time goes on because of the observations for our other hypothesis question.

Things to Improve On

We need to find a better way to represent our predictors so that interpretation and comparisons are easier across eras.

Alternatively, we could find some other regression method that gives us some output statistics on each predictor that can be used universally to compare with the output of other eras.

It would also help to get more domain knowledge on applying Data Science to Basketball Statistics. The following Youtube video and channel dive into the intersection of basketball and analytics, and what are some mistakes to avoid when using Per-Game statistics. I tried my best to follow these guidelines for this dataset.