by Adithya Solai
CMSC320 Final Tutorial
Section 0101 Dickerson
There are numerous reports and studies done about how the NBA has transformed in the past 5 years in regards to the 3-pt shot. The 3-pointer started to become dominant and popular in the NBA following the success of Stephen Curry and his Golden State Warriors (2015, 2017, & 2018 NBA Champions). Here are some great videos and articles about the "Three Point Revolution":
-How Data Changed the NBA by The Economist (Houston Rockets & Second Spectrum): https://www.youtube.com/watch?v=oUvvfHkXyOA&ab_channel=TheEconomist
-https://fivethirtyeight.com/features/how-mapping-shots-in-the-nba-changed-it-forever/
-https://fivethirtyeight.com/features/basketballs-other-3-point-revolution/
-https://fivethirtyeight.com/features/stephen-curry-is-the-revolution/
-As discussed in the resources above, Data Science & Analytics made coaches, players, and team executives comfortable with adopting the 3-pointer. Now, we will use Data Science to test if those decisions paid off and are responsible for more wins.
-In this study, I will calculate and compare the predictive power of 3-pointer-related Season Average Player Statistics (3PFG%, 3PFGA, etc) in predicting NBA Regular Season team win-pct in different eras of the 21st century.
-The four eras will be 2000-2004, 2005-2009, 2010-2014, and 2015-2019.
-I will only use data from players that are "Starters" because they have outsized impact on their team's Win% and are enabled to take a wider array of shots. I define a "Starter" as a player that starts >= 50% of games in a season for their team, which is the same way that the NBA defines "Starter" when determining who is a non-"Starter" to award the 6th Man of the Year award.
-Confirm the narrative: are players actually attempting more 3-pointers as time goes on? (a Simple Linear Regression: 3FieldGoalAttempted ~ Year)
-Are players becoming more efficient 3-point shooters as time goes on as a by-product of the global increase in 3-pointers attempted? (a Simple Linear Regression: 3FieldGoal% ~ Year) My rationale for this is that efficient 3-point shooting has likely gotten more valuable to teams due to my first hypothesis about more 3-pointers being attempted overall, meaning teams will opt to choose players with higher 3-point efficiency over time as their Starters.
-I hypothesize that 3-pt player stats (3-pointers attempted, 3-pointer efficiency, etc) will be statistically significant even when considering all of the other box score player stats for ALL 4 eras. (a Multiple Linear Regression: Win%~All Player Stats)
-Primarily, I am interested to see whether the predictive power of 3pt player stats starts to OUTWEIGH and DOMINATE the other player stats traditionally regarded as having the most predictive power (FG%, TREB, AST, etc) as time goes on.
To show my Data Collection Step-by-Step with relevant output, I will first do a run-thru for just the 2018-2019 NBA Season.
I will also be defining functions along the way that will be used later when doing this process again for all 20 seasons in a for-loop.
import re
import requests
from bs4 import BeautifulSoup, Comment
from os import path
import pandas as pd
import re
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn import linear_model
import statsmodels.api as sm
from scipy import stats
Scrape Regular Season Per-Game Average Stats for all NBA players in 2018-2019 season.
Player Per-Game Season Averages URL Example (2018-2019 season): https://www.basketball-reference.com/leagues/NBA_2019_per_game.html
# Get Per-Game season averages for all NBA players from the 2018-2019 season.
player_avgs_2019_url = 'https://www.basketball-reference.com/leagues/NBA_2019_per_game.html'
def scrape_players(url):
r = requests.get(url)
#print(r) # Make sure we get Response 200
root = BeautifulSoup(r.content)
#print(type(root)) # Make sure BeautifulSoup object initializes
# Find just the HTML content under the `table` tag, which is where the data is!
player_stats_table = root.find("table").prettify()
# Use pandas's html reader to convert our prettified HTML table into a dataframe.
player_df = pd.read_html(player_stats_table)[0]
# Our df has more rows than the # of players shown on the original basketball-reference website.
# This is caused by some rows that are just copies of the column headers, since these rows help readers
# of the webpage remember what each column is as they scroll down the webpage.
# We can just use some filtering to drop these rows
# This is also caused by players that switched teams mid-season and have more than 1 row of season avg stats.
# For the purposes of our study, we won't merge them into one row, since we can't just collapse their contributions
# into just one team. Our study is about how the player's stats impacted the Win% of their team, so we need to keep
# the player's contributions to different teams as separate.
# For the reason above, we will also eventually drop columns with `TOT` as the team, since this a row that combines
# the stats of a player who played for more than one team in one season.
# We will determine whether the player is a starter using a % cuttoff (Games Started / Total Games for that team),
# and not a strict # of Games Started cutoff so that we don't drop starter-caliber players that just happened
# to switch teams mid-season.
return player_df
player_df_2019 = scrape_players(player_avgs_2019_url)
print("player_df shape: ", player_df_2019.shape)
display(player_df_2019)
Scrape NBA Team Standings at the end of the 2018-2019 season.
Team Final Standings URL Example (2018-2019 season): https://www.basketball-reference.com/leagues/NBA_2019_standings.html
For some reason, the well-formatted 'Expanded Standings' table from the URL above is stored inside HTML comments in the HTML source code. I had to track down which comment this was and use the fix found here to scrape the data: https://stackoverflow.com/a/52679343
team_standings_2019_url = 'https://www.basketball-reference.com/leagues/NBA_2019_standings.html'
def scrape_standings(url):
r = requests.get(url)
#print(r) # Make sure we get Response 200
root = BeautifulSoup(r.content, 'lxml')
#print(type(root)) # Make sure BeautifulSoup object initializes
# Scrape all the content within HTML Comments into a list `comments`
comments = root.find_all(text=lambda text:isinstance(text, Comment))
# By analyzing the list of comments above, I observed that the data we want
# is always contained in index 26 for all 20 NBA seasons we will scrape for.
all_standings_comment=comments[26]
# Re-initialize our BeautifulSoup object to only parse and scrape the HTML contents
# stored in the HTML Comment found above
comment_root = BeautifulSoup(all_standings_comment, 'lxml')
# Find all the data under the `table` tag, and prettify it so that it can be parsed
# by pandas' html reader
all_standings_table=comment_root.find("table").prettify()
team_df = pd.read_html(all_standings_table)[0]
# The data originally has a MultiIndex column structure to help indicate
# things like "Division", we don't need that.
team_df.columns=team_df.columns.droplevel()
return team_df
team_df_2019 = scrape_standings(team_standings_2019_url)
display(team_df_2019)
Scrape Current-Day NBA Abbreviations for NBA Teams
def scrape_abbrev():
team_abbrev_url='https://en.wikipedia.org/wiki/Wikipedia:WikiProject_National_Basketball_Association/National_Basketball_Association_team_abbreviations'
r = requests.get(team_abbrev_url)
#print(r) # Make sure we get Response 200
root = BeautifulSoup(r.content, 'lxml')
#print(type(root)) # Make sure BeautifulSoup object initializes
abbrev_table = root.find("table").prettify()
abbrev_df = pd.read_html(abbrev_table)[0]
# Rename columns, and remove first row (which is just the header from the original data source)
abbrev_df.columns = ["Abbrev", "Franchise"]
abbrev_df = abbrev_df[1:]
display(abbrev_df)
return abbrev_df
abbrev_df = scrape_abbrev()
Like Data Collection, I will be showing this process Step-by-Step for the 2018-2019 season dfs created before, and defining functions along the way that will help when tackling all 20 seasons in a for-loop.
player_df
:
-Remove column header copy rows. These rows are in the website to help readers remember what each column means as they scroll down the webpage.
-Remove rows where Tm
column value = TOT, as these are combined stats for players who played on more than one team in a season (due to trades). We can't have this, as it won't merge properly with our team standings table, since there is no team with abbreviation TOT.
def dataprep_players(player_df):
# remove rows that are just copies of the column headers
# they all have their name in the Player column as 'Player', so we can use this to filter them out
# Drop all rows with Player name = 'Player'
player_df = player_df[player_df['Player'] != 'Player']
# Drop rows with Tm=TOT, since these are combined stats for players who played on
# more than 1 team this season. We can't have this for our final analysis
player_df = player_df[player_df['Tm'] != 'TOT']
return player_df
print('Before Dropping Header Copy Rows & Tm=TOT rows: ', player_df_2019.shape)
player_df_2019 = dataprep_players(player_df_2019)
# Check that the overall # of rows has decreased.
print('After Dropping Header Copy Rows & TOT rows: ', player_df_2019.shape)
team_df
:
-Drop all columns except Team
and Overall
-Use regex to create W
and L
column from the values in the Overall
column. W
is the # of wins the team had that season, and L
is the # of losses.
-Use W
and L
columns to construct the win_pct
column (Formula: Wins/Total Games or W
/(W
+L
))
-Using win_pct
in a season instead of Games Won is a natural standardizer/normalizer across seasons. Some seasons may have had less than the standard of 82 (blackout years, COVID-19 pandemic, etc), so Games Won should not be the value used as our response variable.
def dataprep_teams(team_df, abbrev_df):
# Merge with `abbrev_df` to bring over the abbreviations
team_df = team_df.merge(right=abbrev_df, left_on='Team', right_on='Franchise')
# Drop all columns except Team, Overall record, and Abbrev
team_df = team_df[['Abbrev', 'Team', 'Overall']]
# Create a `W` and `L` column out of the `Overall` column
for row in team_df.iterrows():
# Get the current row's `Overall` column value
curr_overall = row[1]['Overall']
# Extract wins and losses from `Overall` column value, and store the match groups
w_and_l= re.search(r"^(\d{1,2})-(\d{1,2})$", curr_overall).groups()
# use the list of matches to retrieve wins and losses
wins = int(w_and_l[0])
losses = int(w_and_l[1])
# Store these values in new columns `W` and `L`
curr_index = row[0]
team_df.at[curr_index, 'W'] = wins
team_df.at[curr_index, 'L'] = losses
# No longer need `Overall` column
team_df = team_df[['Abbrev', 'Team', 'W', 'L']]
# Create win_pct column using formula: W / (W+L)
team_df['win_pct'] = team_df['W'] / (team_df['W'] + team_df['L'])
return team_df
team_df_2019 = dataprep_teams(team_df_2019, abbrev_df)
display(team_df_2019)
Merge player_df
and team_df
into player_winpct_df
:
-With player_df
as left and team_df
as right, inner-Join on on Year & Team to create player_winpct_df
-Filter down to only starters from each year. This is to help see better patterns in 3FGA, as non-starters probably won't see a noticeable increase in 3-pointers attempted since they don't get many shots per game to begin with. Use the definition of starter used to determine the 6th Man of the Year award (https://en.wikipedia.org/wiki/NBA_Sixth_Man_of_the_Year_Award). To not be considered a "Starter", you need to start in less than 50% of games for your team. Therefore, we will define starter as players that start in >= 50% of games. We will create a new column start_pct
by dividing GS
column by G
column, and use this to filter out rows.
-Assign a new column Year
for each of the 20 seasons (just a constant 2019 in this example case). This will help with merges and visualizing our data later on.
-Only keep the following columns (to clean up and only keep statistics for our Linear Regressions that are simple and easy to interpret): Year
, Player
, Tm
, start_pct
, MP
, PTS
, TRB
, AST
, FGA
, FG%
, 3PA
, 3P%
, 2PA
, 2P%
, FTA
, FT%
, STL
, BLK
, TOV
, PF
, start_pct
, win_pct
-Stats like FG
(Field Goals made), FT
, 3P
, and 2P
are omitted because we also have the corresponding Attempts and Accuracy (%) stats, so we can always derive "Made" stats like FG
from those two stats if needed. Including all three could lead to problems and noise in our Linear Regression since the 3 stats would be so related to each other (because they are derived from each other).
def dataprep_combined(player_df, team_df, year):
# Initialize player_winpct_df by inner-joining on `Year` and Team abbreviation,
# with player_df as left and team_df as right
player_winpct_df = player_df.merge(right=team_df[['Abbrev', 'Team', 'win_pct']], left_on='Tm', right_on='Abbrev')
# Convert `G` and `GS` columns to integers so we can create `start_pct` column
player_winpct_df['G'] = player_winpct_df['G'].astype(int)
player_winpct_df['GS'] = player_winpct_df['GS'].astype(int)
# Create `start_pct` column (`Gs` / `G`)
player_winpct_df['start_pct'] = player_winpct_df['GS'] / player_winpct_df['G']
# Only keep rows with start_pct >= 0.500
player_winpct_df = player_winpct_df[player_winpct_df['start_pct'] >= 0.500]
# Add the `Year` columns
player_winpct_df['Year'] = year
# Only keep columns listed above
player_winpct_df = player_winpct_df[["Year", "Player", "Tm", "start_pct", "MP", "PTS", "TRB", "AST", \
"FGA", "FG%", "3PA", "3P%", "2PA", "2P%", "FTA", "FT%", \
"STL", "BLK", "TOV", "PF", "start_pct", "win_pct"]]
return player_winpct_df
player_winpct_df_2019 = dataprep_combined(player_df_2019, team_df_2019, 2019)
display(player_winpct_df_2019)
Basically, we do all of the steps covered in the example above for all 20 seasons. We will union the 20 player_winpct_df's into one large df (which is why we added the Year
column to help us differentiate after the union).
The result of this procedure is what we will use for our upcoming EDA & Linear Regression analysis!
abbrev_df
-Re-create our abbrev_df
using the scrape_abbrev()
function from before. We will just use one abbrev-to-team table for all 20 seasons.
-Manually add the abbreviations and team names of the few teams that changed names/abbreviations from 2000-2019, so that the merges with and between the player_df and team_df tables happen smoothly.
# Get the current-day abbrev-->team pairs
abbrev_df = scrape_abbrev()
# Add some manual rows of teams with different abbreviations and names in the past
manual_abbrevs = [
# Current Teams w/ Diff Abbrev in our dataset
['BRK', 'Brooklyn Nets'],
['PHO', 'Phoenix Suns'],
['CHO', 'Charlotte Hornets'],
# Teams that no longer exist
['VAN','Vancouver Grizzlies'],
['SEA', 'Seattle SuperSonics'],
['CHH', 'Charlotte Hornets'],
['NJN', 'New Jersey Nets'],
['NOH', 'New Orleans Hornets'],
['NOK', 'New Orleans/Oklahoma City Hornets'],
['CHA', 'Charlotte Bobcats'], # Need to remove existing mapping of CHA!
]
# Remove current mapping for 'CHA', 'BKN', and 'PHX', since those
# abbrevs dont exist or mean something else in the basketballreference.com database
abbrev_df = abbrev_df[abbrev_df['Abbrev'] != 'CHA']
abbrev_df = abbrev_df[abbrev_df['Abbrev'] != 'BKN']
abbrev_df = abbrev_df[abbrev_df['Abbrev'] != 'PHX']
other_abbrevs=pd.DataFrame(manual_abbrevs, columns=["Abbrev", "Franchise"])
abbrev_df=abbrev_df.append(other_abbrevs, ignore_index=True)
display(abbrev_df)
player_df
, team_df
, and player_winpct_df
for all 20 seasons:
# Generic URLs that we can format a year into, so we can iterate through the 20 years
player_avgs_url = '''https://www.basketball-reference.com/leagues/NBA_{}_per_game.html'''
team_standings_url = '''https://www.basketball-reference.com/leagues/NBA_{}_standings.html'''
# Empty Dataframe that we will keep appending with the results of each year's scraping + data prep
final_player_winpct_df = pd.DataFrame()
# Iterate from year 2001 (inclusive) to 2020 (exclusive) so that we can retrieve the right data
for year in range(2001,2020):
curr_player_avg_url = player_avgs_url.format(year)
curr_team_standing_url = team_standings_url.format(year)
# Data Collection
player_df = scrape_players(curr_player_avg_url)
team_df = scrape_standings(curr_team_standing_url)
# Data Prep
player_df = dataprep_players(player_df)
team_df = dataprep_teams(team_df, abbrev_df)
# Combine into one df
player_winpct_df = dataprep_combined(player_df, team_df, year)
# Append to our final df
final_player_winpct_df=final_player_winpct_df.append(player_winpct_df, ignore_index=True)
# Convert all numeric columns from string to float to make sure our plotting and Linear Regressions run smoothly
for col_name in final_player_winpct_df.columns[3:]:
final_player_winpct_df[col_name] = final_player_winpct_df[col_name].astype(float)
print(final_player_winpct_df.shape)
# Make sure all teams are represented
# Should be 37, since that is how many we had in our abbrev_df
print(len(final_player_winpct_df['Tm'].unique()))
# Another way to check is to make sure the following equality is true
# This makes sure the abbrevs from abbrev_df is the same as the abbrevs
# actually captured from our 20 seasons on BasketballReference.com
print(set(abbrev_df['Abbrev']) == (set(final_player_winpct_df['Tm'].unique())))
display(final_player_winpct_df)
With the code below, I loaded final_player_winpct_df
to a csv file and uploaded it online.
You can download by visiting this link: https://adithyasolai.com/projects/nba_three_point_revolution/NBA%20Reg%20Season%20Player%20Avgs%20with%20Win%20Pct%202000-2019.csv
#final_player_winpct_df.to_csv("NBA Reg Season Player Avgs with Win Pct 2000-2019.csv")
Plot Violin Plot of 3PA (3-Pointers Attempted Per Game) stat against time to visually see if there is a linear trend (for all 20 years).
Helps to answer the initial question of whether 3-point shooting is actually more popular in the 2015-2020 era than previous eras.
# Create a new column `Year_Short` to be used when plotting so that the plot is cleaner
# Basically just the `Year` column with only the last 2 digits.
final_player_winpct_df['Year Short'] = final_player_winpct_df['Year']
def apply_yr_short(x):
# Just use last 2 digits
return x%100
final_player_winpct_df['Year Short'] = final_player_winpct_df['Year Short'].apply(apply_yr_short)
sns.violinplot(x='Year Short',y='3PA', data=final_player_winpct_df)
plt.title("3-Pointers Attempted Per Game Over Time")
plt.xlabel("Year (2000-2019)")
plt.ylabel("3-Pointers Attempted Per Game")
Comments:
The white dots in each violin represent the Median 3-Pt Attempts Per Game for that year. By following these white dots, we can see there is a linear positive trend. There is relatively steep increase from the year 2013 to 2014 and onwards, which is right around the time the "Three Point Revolution" began to be noticed and implemented across the league.
The violins also show that 3-Pt attempts were right-skewed in the early 2000s, since the violins were fatter and wider towards the bottom and became skinner at the larger 3-Pt Attempt numbers on the y-axis. This means that most players were concentrated around a lower # of 3-Pt shots attempted per game, and a relatively few number of players took many 3-Pt shots.
This right-skewed pattern continues until 2014, when the violins start to look bi-modal for 2014, 2015, and 2016. This means there is a large concentration of players taking few 3-Pt shots and an equally large concentration taking relatively more shots, with few in the middle.
In 2017, 2018, and 2019, the violins become very skinny and don't have any noticeable peaks/skewness, which represents a more uniform distribution. By 2019, we can start to see a new trend emerging where there is a single peak/fatness in the middle of the violin near the median.
(The needle-like tops of the violins are caused by a few specialist outlier players that are likely star players or very efficient 3-Pt marksman that are allowed to take more 3-pointers than most players in the league. We can see these players have existed to some extent in all years. The extent to which these specialist players shoot more 3s is different, however. We can see the outlier specialist players in 2016 and onwards take considerably more 3s than the outlier specialist players from years like 2009-2012.)
We will have to do a formal Linear Regression and t-test to determine if this linear trend from about 2 3-Pt Attempts on average (median from 2001 violin) to about 4 3-Pt Attempts on average (median from 2019 violin) is statistically significant.
Plot Violin Plot of 3P% (3-Pointer Make % or Efficiency) stat against time to visually see if there is a linear trend (for all 20 years).
Helps to see if NBA Starters have gotten more efficient at shooting 3-pointers in the 2015-2020 era alongside the growing 3-pointers attempted stat.
sns.violinplot(x='Year Short',y='3P%', data=final_player_winpct_df)
plt.title("3-Pointers Make % Over Time")
plt.xlabel("Year (2000-2019)")
plt.ylabel("3-Pointer Make %")
Comments:
The outliers in these violins make it harder to make out a linear trend. The outliers of players with near-100% and near-0% efficiency is caused by starters who take very few 3-pointers despite being starters. Some examples of SUPERSTAR-calibar players that could still fall under this outlier category are: Shaquille O'Neal, Dwight Howard, and Ben Simmons. These types of players probably took 1-2 3s an entire season and either made them all (100%) or made none (0%).
However, the median white dots show that median 3-Pt efficiency in 2001 was about 32%, and this figure was close to about 38% in 2019. From my domain knowledge, I think this change is pretty significant, especially since the peak, peak theoretical 3-Pt efficiency humanly possible is about 50% in my opinion (which is still really high). We will have to do a formal Linear Regression and t-test to confirm that this is statistically significant.
Over time, the violins become fatter towards the median and form a clear single peak. This indicates more and more players in the league are able to shoot at a similar 3-Pt efficiency around the median over time. The violins in the early 2000s, by comparison, are much skinnier and uniform, meaning 3-Pt efficiencies are all over the place.
I am not sure if "linear" is the best way to characterize these trends, however. We see the white dots go up from 2004-2009, then dip down until 2013, and then consistently go up until 2019. In future studies, it would be helpful to get data from more years (say 1980-2000) to see if these ups & downs are not significant in the overall trends.
-I will be relying heavily on Linear Regression as my "Machine Learning" tool for helping to test my questions & hypotheses.
-We can use t-tests on coefficieints of the Linear Regression results to test my prediction & questions.
Is there a linear relationship between Year and 3-Pointers Attempted?
Ho (Null Hypothesis): There is no relationship between Year and 3-Pointers Attempted.
or
Ho: B1 = 0 (where B1 is the coefficient for Year in the population-level linear regression model for 3PA~Year)
Ha: B1 /= 0
final_player_winpct_df.columns
# Only using 'Year' columnm, so we need to reshape to fit scikit's fit() function
regr_X = np.array(final_player_winpct_df['Year']).reshape(-1,1)
# Response is '3PA', or the average # of 3-pointers attempted Per Game
regr_y = final_player_winpct_df['3PA']
# Building a linear regression model using scikit's sklearn
regr = linear_model.LinearRegression()
# Calculating the parameters of our regression model using the fit() method
le_year_lin_model = regr.fit(X=regr_X, y=regr_y)
# Coefficient of year in our model
print("Coefficient of year in our model: ", le_year_lin_model.coef_)
# Intercept Value in our model
print("Intercept in our model: ", le_year_lin_model.intercept_)
# Coefficient of Determination Score
print("R^2 Score: ", regr.score(X=regr_X, y=regr_y))
Run again with statsmodels.api OLS to double-check and get hypothesis testing statistics like t-statistic and p-value
# Need to add a column of 1s to create a constant term
# statsmodels.api does not do it for us like sklearn does
summary_X = sm.add_constant(regr_X)
# Make into dataframes to make sure variable names are shown in output
summary_X = pd.DataFrame(summary_X).reset_index(drop=True)
summary_X.columns = ['Constant', 'Year']
summary_y = pd.DataFrame(regr_y).reset_index(drop=True)
summary_est = sm.OLS(summary_y, summary_X)
print(summary_est.fit().summary())
Results/Comments:
Based on the results of fitting our Linear Regression model above, I reject the null hypothesis at a 95% confidence level (alpha=0.05, a two-tailed t-test). The t-value for our predictor Year
is 14.482, which is much greater than t-critical value (1.96) for this confidence level (t-critical values found here: https://www.stat.colostate.edu/inmem/gumina/st201/pdf/Utts-Heckard_t-Table.pdf). Additionally, having a p-value of 0.000 for our predictor Year
is another indicator of rejecting the null hypothesis, and so is not having 0 within our 95% Confidence Interval of [0.080, 0.106].
Therefore, since the coefficient of Year
is positive, we can interpret it as: "On average, 3-point shot attempts increase by 0.093 shots every year in the NBA from 2000-2019."
Therefore, our first hypothesis turned out to be true for our sample (only starters, only 2000-2019) !!!
Repeat the same as above to test whether there is a linear relationship between Year and 3-Point Shot Efficiency:
Ho (Null Hypothesis): There is no relationship between Year and 3-Pointer Efficiency.
or
Ho: B1 = 0 (where B1 is the coefficient for Year in the population-level linear regression model for 3P%~Year)
Ha: B1 /= 0
# Get columns for the regression
regr_Xy = final_player_winpct_df[['Year', '3P%']]
# Drop all rows with NaN for players that never took any 3-point shots
regr_Xy = regr_Xy[regr_Xy['3P%'].isnull() == False]
# Only using 'Year' columnm, so we need to reshape to fit scikit's fit() function
regr_X = np.array(regr_Xy['Year']).reshape(-1,1)
# Response is '3P%', or the 3-Pointer Make % Per Game
regr_y = regr_Xy['3P%']
# Building a linear regression model using scikit's sklearn
regr = linear_model.LinearRegression()
# Calculating the parameters of our regression model using the fit() method
le_year_lin_model = regr.fit(X=regr_X, y=regr_y)
# Coefficient of year in our model
print("Coefficient of year in our model: ", le_year_lin_model.coef_)
# Intercept Value in our model
print("Intercept in our model: ", le_year_lin_model.intercept_)
# Coefficient of Determination Score
print("R^2 Score: ", regr.score(X=regr_X, y=regr_y))
Run again with statsmodels.api OLS to double-check and get hypothesis testing statistics like t-statistic and p-value
# Need to add a column of 1s to create a constant term
# statsmodels.api does not do it for us like sklearn does
summary_X = sm.add_constant(regr_X)
# Make into dataframes to make sure variable names are shown in output
summary_X = pd.DataFrame(summary_X).reset_index(drop=True)
summary_X.columns = ['Constant', 'Year']
summary_y = pd.DataFrame(regr_y).reset_index(drop=True)
summary_est = sm.OLS(summary_y, summary_X)
print(summary_est.fit().summary())
Results/Comments:
I reject the null hypothesis at a 95% confidence level (alpha=0.05, a two-tailed t-test) since we have a p-value of 0.000 for our predictor Year
and 0 is not within our 95% Confidence Interval of [0.002, 0.004].
Therefore, since the coefficient of Year
is positive, we can interpret it as: "On average, 3-point shot accuracy increases by 0.003 (or 0.3%) every year in the NBA from 2000-2019". As mentioned before, this yearly increase of 0.3% is HUGE in the context of this domain because even an overall 3-pt shooting efficiency of 40% is considered GREAT by today's standards.
Therefore, our second hypothesis turned out to be true for our sample (only starters, only 2000-2019) !!! Starters in the NBA are starting to get more efficient at shooting 3s alongside attempting more of them.
Now, we will run a Multiple Linear Regression in each of the 4 eras to see if 3PA & 3P% are still significant predictors of Team Win% when considering all of the other player statistics:
Basically, we want to see what are the favorable stats in a player that lead to a higher win-pct, and whether 3-point attempts and efficiency is a significant player in that equation.
We need to do a bit of data preparation first. We will first need to separate our total dataset into 4 smaller datasets for each era. We will also make some key transformations and standardizations to the stats, and also trim down which stats we focus on. As before, we will walk through all of the transformations with era1 first as an example, and then do it all at once for the other 3 eras in a for-loop
First, let's filter final_player_winpct_df
to just era 1 (2000-2004)
# The 2000-2004 era
era1_end_2004 = final_player_winpct_df[final_player_winpct_df['Year'].isin([2001,2002,2003,2004])]
# print the unique years seen in this df to make sure we only use the years we intended
print(era1_end_2004['Year'].unique())
We don't care about Year
, Player
(the player's name), Tm
, or start_pct
, since these don't help us answer which player metrics lead to more wins
era1_end_2004 = era1_end_2004.drop(labels=['Year', 'Player', 'Tm', 'start_pct', 'Year Short'], axis=1)
era1_end_2004.head()
We will standardize tally stats like PTS
, TRB
(total rebounds), AST
(assists), STL
(steals), BLK
(blocks), TOV
(turnovers), and PF
(personal fouls) by dividing them by the minutes played (or MP
). This will help show how efficient each player is in each of these stats.
We don't want our Linear Regression model to just point out obvious facts like "scoring more points leads to more victories", so we will use efficiency metrics as our predictors instead.
Drop the MP
column after this standardization, as it no longer serves any purpose.
Rename the columns we just standardized to indicate that these are now efficiency metric. Ex: PT
--> efPT
.
# Standardize by converting these to efficiency metrics
era1_end_2004['PTS'] = era1_end_2004['PTS'] / era1_end_2004['MP']
era1_end_2004['TRB'] = era1_end_2004['TRB'] / era1_end_2004['MP']
era1_end_2004['AST'] = era1_end_2004['AST'] / era1_end_2004['MP']
era1_end_2004['STL'] = era1_end_2004['STL'] / era1_end_2004['MP']
era1_end_2004['BLK'] = era1_end_2004['BLK'] / era1_end_2004['MP']
era1_end_2004['TOV'] = era1_end_2004['TOV'] / era1_end_2004['MP']
era1_end_2004['PF'] = era1_end_2004['PF'] / era1_end_2004['MP']
# Drop `MP` column
era1_end_2004 = era1_end_2004.drop(labels=['MP'], axis=1)
# Rename columns
new_names = {'PTS': 'efPTS',
'TRB': 'efTRB',
'AST': 'efAST',
'STL': 'efSTL',
'BLK': 'efBLK',
'TOV': 'efTOV',
'PF': 'efPF'}
era1_end_2004 = era1_end_2004.rename(columns=new_names)
era1_end_2004.head()
Drop the FGA
and FG%
columns, as those columns can be derived from 3PA
, 3P%
, 2PA
, and 2P%
. FGA
and FG%
are basically the attempts and efficiency metrics for "Field Goals", which is just a term to describe non-Free Throw points earned (3-pointers and 2-pointers).
Keeping these columns could cause severe multicollinearity, which adds noise to our final Linear Regression. Read more about multicollinearity here: https://www.statisticshowto.com/multicollinearity/. Being able to transform other predictor variables into FGA
and FG%
makes these variables the worst case scenario for multicollinearity, since the relationship is exact, direct, and CAUSAL.
For the purposes of our study, we would also prefer to isolate 3PA
and 3P%
as much as possible, and remove predictors like FGA
and FG%
that incorporate 3-pointer metrics in its derivation.
era1_end_2004 = era1_end_2004.drop(labels=['FGA', 'FG%'], axis=1)
era1_end_2004.head()
Drop 3PA
, 2PA
, and FTA
since we really only care about efficiency. The attempts of 3-pointers, 2-pointers, and Free Throws is more dependent on the number of minutes a player is allowed to play, so we want to remove this factor to standardize things.
We still have the %
(efficiency) columns for these 3 metrics.
era1_end_2004 = era1_end_2004.drop(labels=['3PA', '2PA', 'FTA'], axis=1)
era1_end_2004.head()
Last bit of Data Prep:
Drop all rows with NaN for 3P%
, 2P%
, and FT%
(players that never took any 3-pt shots, 2-pt shots, or free throws).
print("Before: ", era1_end_2004.shape)
era1_end_2004 = era1_end_2004[era1_end_2004['3P%'].isnull() == False]
era1_end_2004 = era1_end_2004[era1_end_2004['2P%'].isnull() == False]
era1_end_2004 = era1_end_2004[era1_end_2004['FT%'].isnull() == False]
print("After: ", era1_end_2004.shape)
Now, we're ready to run our Multiple Linear Regression with win_pct
as the response, and all other columns above as the predictors.
# Select all columns except `win_pct`, which is our response
regr_X = era1_end_2004.loc[:, era1_end_2004.columns != 'win_pct']
# Response is `win_pct`
regr_y = era1_end_2004['win_pct']
# Need to add a column of 1s to create a constant term
# statsmodels.api does not do it for us like sklearn does
summary_X = sm.add_constant(regr_X)
# Make into dataframes to make sure variable names are shown in output
summary_X = pd.DataFrame(summary_X).reset_index(drop=True)
summary_X.columns = ['Constant'] + list(era1_end_2004.columns[:-1])
summary_y = pd.DataFrame(regr_y).reset_index(drop=True)
summary_est = sm.OLS(summary_y, summary_X)
print(summary_est.fit().summary())
We will analyze these results later!
Now, we will package the data prep and modeling code we just did in our example into convenient functions that will allow us to run this Linear Regression Model across all eras.
def dataprep_linreg(era):
era_local = era.copy()
# Drop useless columns
era_local = era_local.drop(labels=['Year', 'Player', 'Tm', 'start_pct', 'Year Short'], axis=1)
# Standardize by converting these to efficiency metrics
era_local['PTS'] = era_local['PTS'] / era_local['MP']
era_local['TRB'] = era_local['TRB'] / era_local['MP']
era_local['AST'] = era_local['AST'] / era_local['MP']
era_local['STL'] = era_local['STL'] / era_local['MP']
era_local['BLK'] = era_local['BLK'] / era['MP']
era_local['TOV'] = era_local['TOV'] / era_local['MP']
era_local['PF'] = era_local['PF'] / era_local['MP']
# Drop `MP` column
era_local = era_local.drop(labels=['MP'], axis=1)
# Rename columns
new_names = {'PTS': 'efPTS',
'TRB': 'efTRB',
'AST': 'efAST',
'STL': 'efSTL',
'BLK': 'efBLK',
'TOV': 'efTOV',
'PF': 'efPF'}
era_local = era_local.rename(columns=new_names)
# Drop FG-related columns to avoid multicollinearity
era_local = era_local.drop(labels=['FGA', 'FG%'], axis=1)
# Drop Attempt columns, as we care mostly about efficiency
era_local = era_local.drop(labels=['3PA', '2PA', 'FTA'], axis=1)
# Remove NaN values in columns where players took no attempts
era_local = era_local[era_local['3P%'].isnull() == False]
era_local = era_local[era_local['2P%'].isnull() == False]
era_local = era_local[era_local['FT%'].isnull() == False]
return pd.DataFrame(era_local)
def runlinreg(era):
# Select all columns except `win_pct`, which is our response
regr_X = era.loc[:, era.columns != 'win_pct']
# Response is `win_pct`
regr_y = era['win_pct']
# Need to add a column of 1s to create a constant term
# statsmodels.api does not do it for us like sklearn does
summary_X = sm.add_constant(regr_X)
# Make into dataframes to make sure variable names are shown in output
summary_X = pd.DataFrame(summary_X).reset_index(drop=True)
summary_X.columns = ['Constant'] + list(era.columns[:-1])
summary_y = pd.DataFrame(regr_y).reset_index(drop=True)
summary_est = sm.OLS(summary_y, summary_X)
print(summary_est.fit().summary())
Finally, lets run the Linear Regression on all eras and see the output!
# The 2000-2004 era
era1_end_2004 = final_player_winpct_df[final_player_winpct_df['Year'].isin([2001,2002,2003,2004])]
# The 2005-2009 era
era2_end_2009 = final_player_winpct_df[final_player_winpct_df['Year'].isin([2005,2006,2007,2008, 2009])]
# The 2010-2014 era
era3_end_2014 = final_player_winpct_df[final_player_winpct_df['Year'].isin([2010,2011,2012,2013,2014])]
# The 2015-2019 era
era4_end_2019 = final_player_winpct_df[final_player_winpct_df['Year'].isin([2015,2016,2017,2018,2019])]
# Put all era's dfs in a list
eras = [era1_end_2004, era2_end_2009, era3_end_2014, era4_end_2019]
for idx, era in enumerate(eras):
era = dataprep_linreg(era)
print("ERA", idx+1, "LINEAR REGRESSION RESULTS: ")
runlinreg(era)
print()
Results
By using our Hypothesis Testing methodology from before, I determined the following predictors reject the null hypothesis and are significant in predicting win_pct
when holding all other predictors constant. (I did this by just selecting the predictors in each era's Linear Regression summary that had a p-value < 0.05):
Era 1 (2000-2004):
efPTS (p-val = 0.005)
efAST (p-val = 0.000)
2P% (p-val = 0.000)
efTOV (p-val = 0.000)
Era 2 (2005-2009):
efPTS (p-val = 0.001)
efAST (p-val = 0.000)
2P% (p-val = 0.000)
efBLK (p-val = 0.009)
efTOV (p-val = 0.000)
Era 3 (2010-2014):
efPTS (p-val = 0.028)
efAST (p-val = 0.017)
3P% (p-val = 0.002)
2P% (p-val = 0.000)
efSTL (p-val = 0.017)
efBLK (p-val = 0.001)
efTOV (p-val = 0.001)
efPF (p-val = 0.017)
Era 4 (2015-2019):
efPTS (p-val = 0.023)
efAST (p-val = 0.000)
3P% (p-val = 0.004)
2P% (p-val = 0.000)
efBLK (p-val = 0.016)
efTOV (p-val = 0.000)
Conclusion Preamble
Before addressing our questions from the beginning of this study, let's discuss how to interpret and compare the Linear Regression results from the 4 eras. It is hard to use coefficients to compare across eras because, although they are all efficiency metrics, the 3P%
, 2P%
, and FT%
predictors use shots taken as the standardizer, while the other predictors use minutes played. Additionally, these coefficients are hard to use and interpret like we did with our Simple Linear Regression with just one predictor. This is because all of these predictors are in the range 0-1 (0%-100%), and our standard tactic of talking about the change in response after 1 unit of increase in the predictors does not make sense because 1 unit of increase is a whole 100%.
Let's just keep things simple and use the sign of the coefficients (positive or negative) and the magnitude of the p-value (analogous to and derived from the t-statistic) to help make comparisons and observations. A smaller p-value indicates STRONGER evidence to reject the null hypothesis, making predictors with relatively smaller p-values less likely to not actually be a significant predictor compared to other predictors. Read more about the exact definition and interpretation of p-values here: https://www.statsdirect.com/help/basics/p_values.htm.
Also, it is important to remember that the efPTS
predictor is not highly-correlated with 3P%
, 2P%
, and FT%
, since the former is based on minutes played, and the latter is based on shot attempts. Obviously, minutes played and shot attempts are themselves related, but overall this is not as big of an issue as predictors directly derivable from other predictors.
CONCLUSION
My hypothesis that 3-point efficiency would be a significant predictor of success in ALL ERAS was WRONG! As per the results above, 3-point efficiency only became a significant predictor in eras 3 & 4 (2010-2019). I think this result is indicative of the growing power of the 3-point shot. In eras 1 & 2 (2000-2009), the list of significant predictors was much smaller than later eras, and was largely made up of predictors that strictly related to just putting the ball in the hoop to score more points to win more games (points scoring efficiency, 2-pointer scoring efficiency, assists efficiency). In the more recent eras, however, 3-point efficiency started to become significant in its own right even in the context of these staple predictors.
My other hypothesis regarding to what extent 3-point efficiency DOMINATES other predictors is hard to tell with this approach. Although 3P%
does have a lower p-value compared to other predictors in the later eras (3 & 4), I still don't feel comfortable using that as a definitive measure. However, I can at least see that 3-point efficiency has become MORE dominant/important as time goes on because of the observations for our other hypothesis question.
Things to Improve On
We need to find a better way to represent our predictors so that interpretation and comparisons are easier across eras.
Alternatively, we could find some other regression method that gives us some output statistics on each predictor that can be used universally to compare with the output of other eras.
It would also help to get more domain knowledge on applying Data Science to Basketball Statistics. The following Youtube video and channel dive into the intersection of basketball and analytics, and what are some mistakes to avoid when using Per-Game statistics. I tried my best to follow these guidelines for this dataset.
https://www.youtube.com/watch?v=pznoCFs7XZg&ab_channel=ThinkingBasketball