Predictive Power of 3-Pointer for Team Win% in 21st Century NBA¶

by Adithya Solai

CMSC320 Final Tutorial
Section 0101 Dickerson

Motivation & Introduction¶

There are numerous reports and studies done about how the NBA has transformed in the past 5 years in regards to the 3-pt shot. The 3-pointer started to become dominant and popular in the NBA following the success of Stephen Curry and his Golden State Warriors (2015, 2017, & 2018 NBA Champions). Here are some great videos and articles about the "Three Point Revolution":

-https://www.nytimes.com/2016/01/21/sports/basketball/how-the-nba-3-point-shot-went-from-gimmick-to-game-changer.html

-How Data Changed the NBA by The Economist (Houston Rockets & Second Spectrum): https://www.youtube.com/watch?v=oUvvfHkXyOA&ab_channel=TheEconomist

-https://fivethirtyeight.com/features/how-mapping-shots-in-the-nba-changed-it-forever/

-https://fivethirtyeight.com/features/basketballs-other-3-point-revolution/

-https://fivethirtyeight.com/features/stephen-curry-is-the-revolution/

-As discussed in the resources above, Data Science & Analytics made coaches, players, and team executives comfortable with adopting the 3-pointer. Now, we will use Data Science to test if those decisions paid off and are responsible for more wins.

-In this study, I will calculate and compare the predictive power of 3-pointer-related Season Average Player Statistics (3PFG%, 3PFGA, etc) in predicting NBA Regular Season team win-pct in different eras of the 21st century.

-The four eras will be 2000-2004, 2005-2009, 2010-2014, and 2015-2019.

-I will only use data from players that are "Starters" because they have outsized impact on their team's Win% and are enabled to take a wider array of shots. I define a "Starter" as a player that starts >= 50% of games in a season for their team, which is the same way that the NBA defines "Starter" when determining who is a non-"Starter" to award the 6th Man of the Year award.

My Questions/Hypotheses¶

-Confirm the narrative: are players actually attempting more 3-pointers as time goes on? (a Simple Linear Regression: 3FieldGoalAttempted ~ Year)

-Are players becoming more efficient 3-point shooters as time goes on as a by-product of the global increase in 3-pointers attempted? (a Simple Linear Regression: 3FieldGoal% ~ Year) My rationale for this is that efficient 3-point shooting has likely gotten more valuable to teams due to my first hypothesis about more 3-pointers being attempted overall, meaning teams will opt to choose players with higher 3-point efficiency over time as their Starters.

-I hypothesize that 3-pt player stats (3-pointers attempted, 3-pointer efficiency, etc) will be statistically significant even when considering all of the other box score player stats for ALL 4 eras. (a Multiple Linear Regression: Win%~All Player Stats)

-Primarily, I am interested to see whether the predictive power of 3pt player stats starts to OUTWEIGH and DOMINATE the other player stats traditionally regarded as having the most predictive power (FG%, TREB, AST, etc) as time goes on.

Data Collection (For Just 2018-2019 Season)¶

To show my Data Collection Step-by-Step with relevant output, I will first do a run-thru for just the 2018-2019 NBA Season.

I will also be defining functions along the way that will be used later when doing this process again for all 20 seasons in a for-loop.

import re
import requests
from bs4 import BeautifulSoup, Comment
from os import path
import pandas as pd
import re
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn import linear_model
import statsmodels.api as sm
from scipy import stats

Scrape Regular Season Per-Game Average Stats for all NBA players in 2018-2019 season.

Player Per-Game Season Averages URL Example (2018-2019 season): https://www.basketball-reference.com/leagues/NBA_2019_per_game.html

# Get Per-Game season averages for all NBA players from the 2018-2019 season.
player_avgs_2019_url = 'https://www.basketball-reference.com/leagues/NBA_2019_per_game.html'

def scrape_players(url):
    r = requests.get(url)
    #print(r) # Make sure we get Response 200

    root = BeautifulSoup(r.content)
    #print(type(root)) # Make sure BeautifulSoup object initializes

    # Find just the HTML content under the `table` tag, which is where the data is!
    player_stats_table = root.find("table").prettify()

    # Use pandas's html reader to convert our prettified HTML table into a dataframe.
    player_df = pd.read_html(player_stats_table)[0]
    
    # Our df has more rows than the # of players shown on the original basketball-reference website.

    # This is caused by some rows that are just copies of the column headers, since these rows help readers
    # of the webpage remember what each column is as they scroll down the webpage. 
    # We can just use some filtering to drop these rows

    # This is also caused by players that switched teams mid-season and have more than 1 row of season avg stats.
    # For the purposes of our study, we won't merge them into one row, since we can't just collapse their contributions
    # into just one team. Our study is about how the player's stats impacted the Win% of their team, so we need to keep
    # the player's contributions to different teams as separate.
    
    # For the reason above, we will also eventually drop columns with `TOT` as the team, since this a row that combines
    # the stats of a player who played for more than one team in one season.

    # We will determine whether the player is a starter using a % cuttoff (Games Started / Total Games for that team),
    # and not a strict # of Games Started cutoff so that we don't drop starter-caliber players that just happened
    # to switch teams mid-season.
    
    return player_df
    
player_df_2019 = scrape_players(player_avgs_2019_url)
print("player_df shape: ", player_df_2019.shape)
display(player_df_2019)

player_df shape:  (734, 30)

Scrape NBA Team Standings at the end of the 2018-2019 season.

Team Final Standings URL Example (2018-2019 season): https://www.basketball-reference.com/leagues/NBA_2019_standings.html

For some reason, the well-formatted 'Expanded Standings' table from the URL above is stored inside HTML comments in the HTML source code. I had to track down which comment this was and use the fix found here to scrape the data: https://stackoverflow.com/a/52679343

team_standings_2019_url = 'https://www.basketball-reference.com/leagues/NBA_2019_standings.html'

def scrape_standings(url):
    r = requests.get(url)
    #print(r) # Make sure we get Response 200

    root = BeautifulSoup(r.content, 'lxml')
    #print(type(root)) # Make sure BeautifulSoup object initializes
    
    # Scrape all the content within HTML Comments into a list `comments`
    comments = root.find_all(text=lambda text:isinstance(text, Comment))

    # By analyzing the list of comments above, I observed that the data we want
    # is always contained in index 26 for all 20 NBA seasons we will scrape for.
    all_standings_comment=comments[26]

    # Re-initialize our BeautifulSoup object to only parse and scrape the HTML contents 
    # stored in the HTML Comment found above
    comment_root = BeautifulSoup(all_standings_comment, 'lxml')
    
    # Find all the data under the `table` tag, and prettify it so that it can be parsed
    # by pandas' html reader
    all_standings_table=comment_root.find("table").prettify()
    team_df = pd.read_html(all_standings_table)[0]

    # The data originally has a MultiIndex column structure to help indicate
    # things like "Division", we don't need that.
    team_df.columns=team_df.columns.droplevel()
    
    return team_df

team_df_2019 = scrape_standings(team_standings_2019_url)
display(team_df_2019)

Scrape Current-Day NBA Abbreviations for NBA Teams

URL: https://en.wikipedia.org/wiki/Wikipedia:WikiProject_National_Basketball_Association/National_Basketball_Association_team_abbreviations

def scrape_abbrev():
    team_abbrev_url='https://en.wikipedia.org/wiki/Wikipedia:WikiProject_National_Basketball_Association/National_Basketball_Association_team_abbreviations'
    r = requests.get(team_abbrev_url)
    #print(r) # Make sure we get Response 200

    root = BeautifulSoup(r.content, 'lxml')
    #print(type(root)) # Make sure BeautifulSoup object initializes
    
    abbrev_table = root.find("table").prettify()

    abbrev_df = pd.read_html(abbrev_table)[0]

    # Rename columns, and remove first row (which is just the header from the original data source)
    abbrev_df.columns = ["Abbrev", "Franchise"]
    abbrev_df = abbrev_df[1:]

    display(abbrev_df)
    
    return abbrev_df

abbrev_df = scrape_abbrev()

Data Preparation (For Just 2018-2019 Season)¶

Like Data Collection, I will be showing this process Step-by-Step for the 2018-2019 season dfs created before, and defining functions along the way that will help when tackling all 20 seasons in a for-loop.

player_df:

-Remove column header copy rows. These rows are in the website to help readers remember what each column means as they scroll down the webpage.

-Remove rows where Tm column value = TOT, as these are combined stats for players who played on more than one team in a season (due to trades). We can't have this, as it won't merge properly with our team standings table, since there is no team with abbreviation TOT.

def dataprep_players(player_df):
    # remove rows that are just copies of the column headers
    # they all have their name in the Player column as 'Player', so we can use this to filter them out
    # Drop all rows with Player name = 'Player'
    player_df = player_df[player_df['Player'] != 'Player']
    
    # Drop rows with Tm=TOT, since these are combined stats for players who played on 
    # more than 1 team this season. We can't have this for our final analysis
    player_df = player_df[player_df['Tm'] != 'TOT']
    
    return player_df
    
print('Before Dropping Header Copy Rows & Tm=TOT rows: ', player_df_2019.shape)
player_df_2019 = dataprep_players(player_df_2019)
# Check that the overall # of rows has decreased.
print('After Dropping Header Copy Rows & TOT rows: ', player_df_2019.shape)

Before Dropping Header Copy Rows & Tm=TOT rows:  (734, 30)
After Dropping Header Copy Rows & TOT rows:  (622, 30)

team_df:

-Drop all columns except Team and Overall

-Use regex to create W and L column from the values in the Overall column. W is the # of wins the team had that season, and L is the # of losses.

-Use W and L columns to construct the win_pct column (Formula: Wins/Total Games or W/(W+L))

-Using win_pct in a season instead of Games Won is a natural standardizer/normalizer across seasons. Some seasons may have had less than the standard of 82 (blackout years, COVID-19 pandemic, etc), so Games Won should not be the value used as our response variable.

def dataprep_teams(team_df, abbrev_df):
    # Merge with `abbrev_df` to bring over the abbreviations
    team_df = team_df.merge(right=abbrev_df, left_on='Team', right_on='Franchise')

    # Drop all columns except Team, Overall record, and Abbrev
    team_df = team_df[['Abbrev', 'Team', 'Overall']]

    # Create a `W` and `L` column out of the `Overall` column
    for row in team_df.iterrows():
        # Get the current row's `Overall` column value
        curr_overall = row[1]['Overall']

        # Extract wins and losses from `Overall` column value, and store the match groups
        w_and_l= re.search(r"^(\d{1,2})-(\d{1,2})$", curr_overall).groups()

        # use the list of matches to retrieve wins and losses
        wins = int(w_and_l[0])
        losses = int(w_and_l[1])

        # Store these values in new columns `W` and `L`
        curr_index = row[0]
        team_df.at[curr_index, 'W'] = wins
        team_df.at[curr_index, 'L'] = losses

    # No longer need `Overall` column
    team_df = team_df[['Abbrev', 'Team', 'W', 'L']]
    
    # Create win_pct column using formula: W / (W+L)
    team_df['win_pct'] = team_df['W'] / (team_df['W'] + team_df['L'])
    
    return team_df

team_df_2019 = dataprep_teams(team_df_2019, abbrev_df)
display(team_df_2019)

Merge player_df and team_df into player_winpct_df:

-With player_df as left and team_df as right, inner-Join on on Year & Team to create player_winpct_df

-Filter down to only starters from each year. This is to help see better patterns in 3FGA, as non-starters probably won't see a noticeable increase in 3-pointers attempted since they don't get many shots per game to begin with. Use the definition of starter used to determine the 6th Man of the Year award (https://en.wikipedia.org/wiki/NBA_Sixth_Man_of_the_Year_Award). To not be considered a "Starter", you need to start in less than 50% of games for your team. Therefore, we will define starter as players that start in >= 50% of games. We will create a new column start_pct by dividing GS column by G column, and use this to filter out rows.

-Assign a new column Year for each of the 20 seasons (just a constant 2019 in this example case). This will help with merges and visualizing our data later on.

-Only keep the following columns (to clean up and only keep statistics for our Linear Regressions that are simple and easy to interpret): Year, Player, Tm, start_pct, MP, PTS, TRB, AST, FGA, FG%, 3PA, 3P%, 2PA, 2P%, FTA, FT%, STL, BLK, TOV, PF, start_pct, win_pct

-Stats like FG (Field Goals made), FT, 3P, and 2P are omitted because we also have the corresponding Attempts and Accuracy (%) stats, so we can always derive "Made" stats like FG from those two stats if needed. Including all three could lead to problems and noise in our Linear Regression since the 3 stats would be so related to each other (because they are derived from each other).

def dataprep_combined(player_df, team_df, year):
    # Initialize player_winpct_df by inner-joining on `Year` and Team abbreviation,
    # with player_df as left and team_df as right
    player_winpct_df = player_df.merge(right=team_df[['Abbrev', 'Team', 'win_pct']], left_on='Tm', right_on='Abbrev')
    
    # Convert `G` and `GS` columns to integers so we can create `start_pct` column
    player_winpct_df['G'] = player_winpct_df['G'].astype(int)
    player_winpct_df['GS'] = player_winpct_df['GS'].astype(int)
    
    # Create `start_pct` column (`Gs` / `G`)
    player_winpct_df['start_pct'] = player_winpct_df['GS'] / player_winpct_df['G']
    
    # Only keep rows with start_pct >= 0.500
    player_winpct_df = player_winpct_df[player_winpct_df['start_pct'] >= 0.500]
    
    # Add the `Year` columns
    player_winpct_df['Year'] = year
    
    # Only keep columns listed above
    player_winpct_df = player_winpct_df[["Year", "Player", "Tm", "start_pct", "MP", "PTS", "TRB", "AST", \
                                         "FGA", "FG%", "3PA", "3P%", "2PA", "2P%", "FTA", "FT%", \
                                         "STL", "BLK", "TOV", "PF", "start_pct", "win_pct"]]
    
    return player_winpct_df
    
player_winpct_df_2019 = dataprep_combined(player_df_2019, team_df_2019, 2019)
display(player_winpct_df_2019)

Data Collection & Preparation (For All Seasons 2000-2001 to 2018-2019)¶

Basically, we do all of the steps covered in the example above for all 20 seasons. We will union the 20 player_winpct_df's into one large df (which is why we added the Year column to help us differentiate after the union).

The result of this procedure is what we will use for our upcoming EDA & Linear Regression analysis!

abbrev_df

-Re-create our abbrev_df using the scrape_abbrev() function from before. We will just use one abbrev-to-team table for all 20 seasons.

-Manually add the abbreviations and team names of the few teams that changed names/abbreviations from 2000-2019, so that the merges with and between the player_df and team_df tables happen smoothly.

# Get the current-day abbrev-->team pairs
abbrev_df = scrape_abbrev()

# Add some manual rows of teams with different abbreviations and names in the past
manual_abbrevs = [
    # Current Teams w/ Diff Abbrev in our dataset
    ['BRK', 'Brooklyn Nets'],
    ['PHO', 'Phoenix Suns'],
    ['CHO', 'Charlotte Hornets'],
    # Teams that no longer exist
    ['VAN','Vancouver Grizzlies'],
    ['SEA', 'Seattle SuperSonics'],
    ['CHH', 'Charlotte Hornets'],
    ['NJN', 'New Jersey Nets'],
    ['NOH', 'New Orleans Hornets'],
    ['NOK', 'New Orleans/Oklahoma City Hornets'],
    ['CHA', 'Charlotte Bobcats'], # Need to remove existing mapping of CHA!
]

# Remove current mapping for 'CHA', 'BKN', and 'PHX', since those
# abbrevs dont exist or mean something else in the basketballreference.com database
abbrev_df = abbrev_df[abbrev_df['Abbrev'] != 'CHA']
abbrev_df = abbrev_df[abbrev_df['Abbrev'] != 'BKN']
abbrev_df = abbrev_df[abbrev_df['Abbrev'] != 'PHX']

other_abbrevs=pd.DataFrame(manual_abbrevs, columns=["Abbrev", "Franchise"])

abbrev_df=abbrev_df.append(other_abbrevs, ignore_index=True)
display(abbrev_df)

player_df, team_df, and player_winpct_df for all 20 seasons:

# Generic URLs that we can format a year into, so we can iterate through the 20 years
player_avgs_url = '''https://www.basketball-reference.com/leagues/NBA_{}_per_game.html'''
team_standings_url = '''https://www.basketball-reference.com/leagues/NBA_{}_standings.html'''

# Empty Dataframe that we will keep appending with the results of each year's scraping + data prep
final_player_winpct_df = pd.DataFrame()

# Iterate from year 2001 (inclusive) to 2020 (exclusive) so that we can retrieve the right data
for year in range(2001,2020):
    curr_player_avg_url = player_avgs_url.format(year)
    curr_team_standing_url = team_standings_url.format(year)
    
    # Data Collection
    player_df = scrape_players(curr_player_avg_url)
    team_df = scrape_standings(curr_team_standing_url)
    
    # Data Prep
    player_df = dataprep_players(player_df)
    team_df = dataprep_teams(team_df, abbrev_df)
    
    # Combine into one df
    player_winpct_df = dataprep_combined(player_df, team_df, year)
    
    # Append to our final df
    final_player_winpct_df=final_player_winpct_df.append(player_winpct_df, ignore_index=True)
    
# Convert all numeric columns from string to float to make sure our plotting and Linear Regressions run smoothly
for col_name in final_player_winpct_df.columns[3:]:
    final_player_winpct_df[col_name] = final_player_winpct_df[col_name].astype(float)

print(final_player_winpct_df.shape)
# Make sure all teams are represented
# Should be 37, since that is how many we had in our abbrev_df
print(len(final_player_winpct_df['Tm'].unique()))
# Another way to check is to make sure the following equality is true
# This makes sure the abbrevs from abbrev_df is the same as the abbrevs
# actually captured from our 20 seasons on BasketballReference.com
print(set(abbrev_df['Abbrev']) == (set(final_player_winpct_df['Tm'].unique())))
display(final_player_winpct_df)

(3510, 22)
37
True

Want to try out this dataset yourself?¶

With the code below, I loaded final_player_winpct_df to a csv file and uploaded it online.

You can download by visiting this link: https://adithyasolai.com/projects/nba_three_point_revolution/NBA%20Reg%20Season%20Player%20Avgs%20with%20Win%20Pct%202000-2019.csv

#final_player_winpct_df.to_csv("NBA Reg Season Player Avgs with Win Pct 2000-2019.csv")

EDA (Exploratory Data Analysis)¶

Plot Violin Plot of 3PA (3-Pointers Attempted Per Game) stat against time to visually see if there is a linear trend (for all 20 years).

Helps to answer the initial question of whether 3-point shooting is actually more popular in the 2015-2020 era than previous eras.

# Create a new column `Year_Short` to be used when plotting so that the plot is cleaner
# Basically just the `Year` column with only the last 2 digits.
final_player_winpct_df['Year Short'] = final_player_winpct_df['Year']
def apply_yr_short(x):
    # Just use last 2 digits
    return x%100
final_player_winpct_df['Year Short'] = final_player_winpct_df['Year Short'].apply(apply_yr_short)

sns.violinplot(x='Year Short',y='3PA', data=final_player_winpct_df)
plt.title("3-Pointers Attempted Per Game Over Time")
plt.xlabel("Year (2000-2019)")
plt.ylabel("3-Pointers Attempted Per Game")

Text(0, 0.5, '3-Pointers Attempted Per Game')

Comments:

The white dots in each violin represent the Median 3-Pt Attempts Per Game for that year. By following these white dots, we can see there is a linear positive trend. There is relatively steep increase from the year 2013 to 2014 and onwards, which is right around the time the "Three Point Revolution" began to be noticed and implemented across the league.

The violins also show that 3-Pt attempts were right-skewed in the early 2000s, since the violins were fatter and wider towards the bottom and became skinner at the larger 3-Pt Attempt numbers on the y-axis. This means that most players were concentrated around a lower # of 3-Pt shots attempted per game, and a relatively few number of players took many 3-Pt shots.

This right-skewed pattern continues until 2014, when the violins start to look bi-modal for 2014, 2015, and 2016. This means there is a large concentration of players taking few 3-Pt shots and an equally large concentration taking relatively more shots, with few in the middle.

In 2017, 2018, and 2019, the violins become very skinny and don't have any noticeable peaks/skewness, which represents a more uniform distribution. By 2019, we can start to see a new trend emerging where there is a single peak/fatness in the middle of the violin near the median.

(The needle-like tops of the violins are caused by a few specialist outlier players that are likely star players or very efficient 3-Pt marksman that are allowed to take more 3-pointers than most players in the league. We can see these players have existed to some extent in all years. The extent to which these specialist players shoot more 3s is different, however. We can see the outlier specialist players in 2016 and onwards take considerably more 3s than the outlier specialist players from years like 2009-2012.)

We will have to do a formal Linear Regression and t-test to determine if this linear trend from about 2 3-Pt Attempts on average (median from 2001 violin) to about 4 3-Pt Attempts on average (median from 2019 violin) is statistically significant.

Plot Violin Plot of 3P% (3-Pointer Make % or Efficiency) stat against time to visually see if there is a linear trend (for all 20 years).

Helps to see if NBA Starters have gotten more efficient at shooting 3-pointers in the 2015-2020 era alongside the growing 3-pointers attempted stat.

sns.violinplot(x='Year Short',y='3P%', data=final_player_winpct_df)
plt.title("3-Pointers Make % Over Time")
plt.xlabel("Year (2000-2019)")
plt.ylabel("3-Pointer Make %")

Text(0, 0.5, '3-Pointer Make %')

Comments:

The outliers in these violins make it harder to make out a linear trend. The outliers of players with near-100% and near-0% efficiency is caused by starters who take very few 3-pointers despite being starters. Some examples of SUPERSTAR-calibar players that could still fall under this outlier category are: Shaquille O'Neal, Dwight Howard, and Ben Simmons. These types of players probably took 1-2 3s an entire season and either made them all (100%) or made none (0%).

However, the median white dots show that median 3-Pt efficiency in 2001 was about 32%, and this figure was close to about 38% in 2019. From my domain knowledge, I think this change is pretty significant, especially since the peak, peak theoretical 3-Pt efficiency humanly possible is about 50% in my opinion (which is still really high). We will have to do a formal Linear Regression and t-test to confirm that this is statistically significant.

Over time, the violins become fatter towards the median and form a clear single peak. This indicates more and more players in the league are able to shoot at a similar 3-Pt efficiency around the median over time. The violins in the early 2000s, by comparison, are much skinnier and uniform, meaning 3-Pt efficiencies are all over the place.

I am not sure if "linear" is the best way to characterize these trends, however. We see the white dots go up from 2004-2009, then dip down until 2013, and then consistently go up until 2019. In future studies, it would be helpful to get data from more years (say 1980-2000) to see if these ups & downs are not significant in the overall trends.

Hypothesis Testing & "Machine Learning"¶

-I will be relying heavily on Linear Regression as my "Machine Learning" tool for helping to test my questions & hypotheses.

-We can use t-tests on coefficieints of the Linear Regression results to test my prediction & questions.

Is there a linear relationship between Year and 3-Pointers Attempted?

Ho (Null Hypothesis): There is no relationship between Year and 3-Pointers Attempted.

or

Ho: B1 = 0 (where B1 is the coefficient for Year in the population-level linear regression model for 3PA~Year)

Ha: B1 /= 0

final_player_winpct_df.columns

Index(['Year', 'Player', 'Tm', 'start_pct', 'MP', 'PTS', 'TRB', 'AST', 'FGA',
       'FG%', '3PA', '3P%', '2PA', '2P%', 'FTA', 'FT%', 'STL', 'BLK', 'TOV',
       'PF', 'start_pct', 'win_pct', 'Year Short'],
      dtype='object')

# Only using 'Year' columnm, so we need to reshape to fit scikit's fit() function
regr_X = np.array(final_player_winpct_df['Year']).reshape(-1,1)
# Response is '3PA', or the average # of 3-pointers attempted Per Game
regr_y = final_player_winpct_df['3PA']

# Building a linear regression model using scikit's sklearn
regr = linear_model.LinearRegression()

# Calculating the parameters of our regression model using the fit() method
le_year_lin_model = regr.fit(X=regr_X, y=regr_y)

# Coefficient of year in our model
print("Coefficient of year in our model: ", le_year_lin_model.coef_)

# Intercept Value in our model
print("Intercept in our model: ", le_year_lin_model.intercept_)

# Coefficient of Determination Score
print("R^2 Score: ", regr.score(X=regr_X, y=regr_y))

Coefficient of year in our model:  [0.09300827]
Intercept in our model:  -184.53683490923714
R^2 Score:  0.056415071807798034

Run again with statsmodels.api OLS to double-check and get hypothesis testing statistics like t-statistic and p-value

# Need to add a column of 1s to create a constant term
# statsmodels.api does not do it for us like sklearn does
summary_X = sm.add_constant(regr_X)

# Make into dataframes to make sure variable names are shown in output
summary_X = pd.DataFrame(summary_X).reset_index(drop=True)
summary_X.columns = ['Constant', 'Year']
summary_y = pd.DataFrame(regr_y).reset_index(drop=True)

summary_est = sm.OLS(summary_y, summary_X)


print(summary_est.fit().summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    3PA   R-squared:                       0.056
Model:                            OLS   Adj. R-squared:                  0.056
Method:                 Least Squares   F-statistic:                     209.7
Date:                Tue, 15 Dec 2020   Prob (F-statistic):           3.29e-46
Time:                        19:57:15   Log-Likelihood:                -7546.5
No. Observations:                3510   AIC:                         1.510e+04
Df Residuals:                    3508   BIC:                         1.511e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Constant    -184.5368     12.909    -14.295      0.000    -209.848    -159.226
Year           0.0930      0.006     14.482      0.000       0.080       0.106
==============================================================================
Omnibus:                      167.273   Durbin-Watson:                   2.112
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              163.733
Skew:                           0.484   Prob(JB):                     2.79e-36
Kurtosis:                       2.574   Cond. No.                     7.40e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.4e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Results/Comments:

Based on the results of fitting our Linear Regression model above, I reject the null hypothesis at a 95% confidence level (alpha=0.05, a two-tailed t-test). The t-value for our predictor Year is 14.482, which is much greater than t-critical value (1.96) for this confidence level (t-critical values found here: https://www.stat.colostate.edu/inmem/gumina/st201/pdf/Utts-Heckard_t-Table.pdf). Additionally, having a p-value of 0.000 for our predictor Year is another indicator of rejecting the null hypothesis, and so is not having 0 within our 95% Confidence Interval of [0.080, 0.106].

Therefore, since the coefficient of Year is positive, we can interpret it as: "On average, 3-point shot attempts increase by 0.093 shots every year in the NBA from 2000-2019."

Therefore, our first hypothesis turned out to be true for our sample (only starters, only 2000-2019) !!!

Repeat the same as above to test whether there is a linear relationship between Year and 3-Point Shot Efficiency:

Ho (Null Hypothesis): There is no relationship between Year and 3-Pointer Efficiency.

or

Ho: B1 = 0 (where B1 is the coefficient for Year in the population-level linear regression model for 3P%~Year)

Ha: B1 /= 0

# Get columns for the regression
regr_Xy = final_player_winpct_df[['Year', '3P%']]
# Drop all rows with NaN for players that never took any 3-point shots
regr_Xy = regr_Xy[regr_Xy['3P%'].isnull() == False]

# Only using 'Year' columnm, so we need to reshape to fit scikit's fit() function
regr_X = np.array(regr_Xy['Year']).reshape(-1,1)

# Response is '3P%', or the 3-Pointer Make % Per Game
regr_y = regr_Xy['3P%']

# Building a linear regression model using scikit's sklearn
regr = linear_model.LinearRegression()

# Calculating the parameters of our regression model using the fit() method
le_year_lin_model = regr.fit(X=regr_X, y=regr_y)

# Coefficient of year in our model
print("Coefficient of year in our model: ", le_year_lin_model.coef_)

# Intercept Value in our model
print("Intercept in our model: ", le_year_lin_model.intercept_)

# Coefficient of Determination Score
print("R^2 Score: ", regr.score(X=regr_X, y=regr_y))

Coefficient of year in our model:  [0.00299182]
Intercept in our model:  -5.716974555534372
R^2 Score:  0.012910979284530444

Run again with statsmodels.api OLS to double-check and get hypothesis testing statistics like t-statistic and p-value

# Need to add a column of 1s to create a constant term
# statsmodels.api does not do it for us like sklearn does
summary_X = sm.add_constant(regr_X)

# Make into dataframes to make sure variable names are shown in output
summary_X = pd.DataFrame(summary_X).reset_index(drop=True)
summary_X.columns = ['Constant', 'Year']
summary_y = pd.DataFrame(regr_y).reset_index(drop=True)

summary_est = sm.OLS(summary_y, summary_X)

print(summary_est.fit().summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    3P%   R-squared:                       0.013
Model:                            OLS   Adj. R-squared:                  0.013
Method:                 Least Squares   F-statistic:                     41.86
Date:                Tue, 15 Dec 2020   Prob (F-statistic):           1.13e-10
Time:                        19:57:15   Log-Likelihood:                 1677.0
No. Observations:                3202   AIC:                            -3350.
Df Residuals:                    3200   BIC:                            -3338.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Constant      -5.7170      0.930     -6.150      0.000      -7.540      -3.894
Year           0.0030      0.000      6.470      0.000       0.002       0.004
==============================================================================
Omnibus:                      285.721   Durbin-Watson:                   2.108
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1382.180
Skew:                          -0.289   Prob(JB):                    7.30e-301
Kurtosis:                       6.166   Cond. No.                     7.38e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.38e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Results/Comments:

I reject the null hypothesis at a 95% confidence level (alpha=0.05, a two-tailed t-test) since we have a p-value of 0.000 for our predictor Year and 0 is not within our 95% Confidence Interval of [0.002, 0.004].

Therefore, since the coefficient of Year is positive, we can interpret it as: "On average, 3-point shot accuracy increases by 0.003 (or 0.3%) every year in the NBA from 2000-2019". As mentioned before, this yearly increase of 0.3% is HUGE in the context of this domain because even an overall 3-pt shooting efficiency of 40% is considered GREAT by today's standards.

Therefore, our second hypothesis turned out to be true for our sample (only starters, only 2000-2019) !!! Starters in the NBA are starting to get more efficient at shooting 3s alongside attempting more of them.

Now, we will run a Multiple Linear Regression in each of the 4 eras to see if 3PA & 3P% are still significant predictors of Team Win% when considering all of the other player statistics:

Basically, we want to see what are the favorable stats in a player that lead to a higher win-pct, and whether 3-point attempts and efficiency is a significant player in that equation.

We need to do a bit of data preparation first. We will first need to separate our total dataset into 4 smaller datasets for each era. We will also make some key transformations and standardizations to the stats, and also trim down which stats we focus on. As before, we will walk through all of the transformations with era1 first as an example, and then do it all at once for the other 3 eras in a for-loop

First, let's filter final_player_winpct_df to just era 1 (2000-2004)

# The 2000-2004 era
era1_end_2004 = final_player_winpct_df[final_player_winpct_df['Year'].isin([2001,2002,2003,2004])]
# print the unique years seen in this df to make sure we only use the years we intended
print(era1_end_2004['Year'].unique())

[2001 2002 2003 2004]

We don't care about Year, Player (the player's name), Tm, or start_pct, since these don't help us answer which player metrics lead to more wins

era1_end_2004 = era1_end_2004.drop(labels=['Year', 'Player', 'Tm', 'start_pct', 'Year Short'], axis=1)
era1_end_2004.head()

We will standardize tally stats like PTS, TRB (total rebounds), AST (assists), STL (steals), BLK (blocks), TOV (turnovers), and PF (personal fouls) by dividing them by the minutes played (or MP). This will help show how efficient each player is in each of these stats.

We don't want our Linear Regression model to just point out obvious facts like "scoring more points leads to more victories", so we will use efficiency metrics as our predictors instead.

Drop the MP column after this standardization, as it no longer serves any purpose.

Rename the columns we just standardized to indicate that these are now efficiency metric. Ex: PT --> efPT.

# Standardize by converting these to efficiency metrics
era1_end_2004['PTS'] = era1_end_2004['PTS'] / era1_end_2004['MP']
era1_end_2004['TRB'] = era1_end_2004['TRB'] / era1_end_2004['MP']
era1_end_2004['AST'] = era1_end_2004['AST'] / era1_end_2004['MP']
era1_end_2004['STL'] = era1_end_2004['STL'] / era1_end_2004['MP']
era1_end_2004['BLK'] = era1_end_2004['BLK'] / era1_end_2004['MP']
era1_end_2004['TOV'] = era1_end_2004['TOV'] / era1_end_2004['MP']
era1_end_2004['PF'] = era1_end_2004['PF'] / era1_end_2004['MP']

# Drop `MP` column
era1_end_2004 = era1_end_2004.drop(labels=['MP'], axis=1)

# Rename columns
new_names = {'PTS': 'efPTS',
             'TRB': 'efTRB',
             'AST': 'efAST',
             'STL': 'efSTL',
             'BLK': 'efBLK',
             'TOV': 'efTOV',
             'PF': 'efPF'}
era1_end_2004 = era1_end_2004.rename(columns=new_names)

era1_end_2004.head()

Drop the FGA and FG% columns, as those columns can be derived from 3PA, 3P%, 2PA, and 2P%. FGA and FG% are basically the attempts and efficiency metrics for "Field Goals", which is just a term to describe non-Free Throw points earned (3-pointers and 2-pointers).

Keeping these columns could cause severe multicollinearity, which adds noise to our final Linear Regression. Read more about multicollinearity here: https://www.statisticshowto.com/multicollinearity/. Being able to transform other predictor variables into FGA and FG% makes these variables the worst case scenario for multicollinearity, since the relationship is exact, direct, and CAUSAL.

For the purposes of our study, we would also prefer to isolate 3PA and 3P% as much as possible, and remove predictors like FGA and FG% that incorporate 3-pointer metrics in its derivation.

era1_end_2004 = era1_end_2004.drop(labels=['FGA', 'FG%'], axis=1)
era1_end_2004.head()

Drop 3PA, 2PA, and FTA since we really only care about efficiency. The attempts of 3-pointers, 2-pointers, and Free Throws is more dependent on the number of minutes a player is allowed to play, so we want to remove this factor to standardize things.

We still have the % (efficiency) columns for these 3 metrics.

era1_end_2004 = era1_end_2004.drop(labels=['3PA', '2PA', 'FTA'], axis=1)
era1_end_2004.head()

Last bit of Data Prep:

Drop all rows with NaN for 3P%, 2P%, and FT% (players that never took any 3-pt shots, 2-pt shots, or free throws).

print("Before: ", era1_end_2004.shape)
era1_end_2004 = era1_end_2004[era1_end_2004['3P%'].isnull() == False]
era1_end_2004 = era1_end_2004[era1_end_2004['2P%'].isnull() == False]
era1_end_2004 = era1_end_2004[era1_end_2004['FT%'].isnull() == False]
print("After: ", era1_end_2004.shape)

Before:  (707, 11)
After:  (627, 11)

Now, we're ready to run our Multiple Linear Regression with win_pct as the response, and all other columns above as the predictors.

# Select all columns except `win_pct`, which is our response
regr_X = era1_end_2004.loc[:, era1_end_2004.columns != 'win_pct']

# Response is `win_pct`
regr_y = era1_end_2004['win_pct']

# Need to add a column of 1s to create a constant term
# statsmodels.api does not do it for us like sklearn does
summary_X = sm.add_constant(regr_X)

# Make into dataframes to make sure variable names are shown in output
summary_X = pd.DataFrame(summary_X).reset_index(drop=True)
summary_X.columns =  ['Constant'] + list(era1_end_2004.columns[:-1])
summary_y = pd.DataFrame(regr_y).reset_index(drop=True)

summary_est = sm.OLS(summary_y, summary_X)

print(summary_est.fit().summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                win_pct   R-squared:                       0.183
Model:                            OLS   Adj. R-squared:                  0.170
Method:                 Least Squares   F-statistic:                     13.81
Date:                Tue, 15 Dec 2020   Prob (F-statistic):           3.99e-22
Time:                        19:57:15   Log-Likelihood:                 379.33
No. Observations:                 627   AIC:                            -736.7
Df Residuals:                     616   BIC:                            -687.8
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Constant       0.0599      0.091      0.658      0.511      -0.119       0.239
efPTS          0.1887      0.067      2.815      0.005       0.057       0.320
efTRB          0.1524      0.119      1.281      0.201      -0.081       0.386
efAST          0.7478      0.146      5.111      0.000       0.460       1.035
3P%            0.0096      0.043      0.224      0.823      -0.074       0.093
2P%            0.9283      0.144      6.466      0.000       0.646       1.210
FT%           -0.0370      0.072     -0.514      0.608      -0.178       0.104
efSTL          0.6727      0.488      1.379      0.168      -0.285       1.631
efBLK          0.4282      0.411      1.042      0.298      -0.379       1.236
efTOV         -2.8124      0.450     -6.251      0.000      -3.696      -1.929
efPF          -0.1902      0.283     -0.672      0.502      -0.746       0.366
==============================================================================
Omnibus:                       22.496   Durbin-Watson:                   0.682
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               15.251
Skew:                          -0.259   Prob(JB):                     0.000488
Kurtosis:                       2.438   Cond. No.                         135.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

C:\Users\pratb\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py:2389: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
  return ptp(axis=axis, out=out, **kwargs)

We will analyze these results later!

Now, we will package the data prep and modeling code we just did in our example into convenient functions that will allow us to run this Linear Regression Model across all eras.

def dataprep_linreg(era):
    era_local = era.copy()
    
    # Drop useless columns
    era_local = era_local.drop(labels=['Year', 'Player', 'Tm', 'start_pct', 'Year Short'], axis=1)
    
    # Standardize by converting these to efficiency metrics
    era_local['PTS'] = era_local['PTS'] / era_local['MP']
    era_local['TRB'] = era_local['TRB'] / era_local['MP']
    era_local['AST'] = era_local['AST'] / era_local['MP']
    era_local['STL'] = era_local['STL'] / era_local['MP']
    era_local['BLK'] = era_local['BLK'] / era['MP']
    era_local['TOV'] = era_local['TOV'] / era_local['MP']
    era_local['PF'] = era_local['PF'] / era_local['MP']

    # Drop `MP` column
    era_local = era_local.drop(labels=['MP'], axis=1)

    # Rename columns
    new_names = {'PTS': 'efPTS',
                 'TRB': 'efTRB',
                 'AST': 'efAST',
                 'STL': 'efSTL',
                 'BLK': 'efBLK',
                 'TOV': 'efTOV',
                 'PF': 'efPF'}
    era_local = era_local.rename(columns=new_names)
    
    # Drop FG-related columns to avoid multicollinearity
    era_local = era_local.drop(labels=['FGA', 'FG%'], axis=1)
    
    # Drop Attempt columns, as we care mostly about efficiency
    era_local = era_local.drop(labels=['3PA', '2PA', 'FTA'], axis=1)
    
    # Remove NaN values in columns where players took no attempts
    era_local = era_local[era_local['3P%'].isnull() == False]
    era_local = era_local[era_local['2P%'].isnull() == False]
    era_local = era_local[era_local['FT%'].isnull() == False]
    
    return pd.DataFrame(era_local)

def runlinreg(era):
    # Select all columns except `win_pct`, which is our response
    regr_X = era.loc[:, era.columns != 'win_pct']

    # Response is `win_pct`
    regr_y = era['win_pct']

    # Need to add a column of 1s to create a constant term
    # statsmodels.api does not do it for us like sklearn does
    summary_X = sm.add_constant(regr_X)

    # Make into dataframes to make sure variable names are shown in output
    summary_X = pd.DataFrame(summary_X).reset_index(drop=True)
    summary_X.columns =  ['Constant'] + list(era.columns[:-1])
    summary_y = pd.DataFrame(regr_y).reset_index(drop=True)

    summary_est = sm.OLS(summary_y, summary_X)

    print(summary_est.fit().summary())

Finally, lets run the Linear Regression on all eras and see the output!

# The 2000-2004 era
era1_end_2004 = final_player_winpct_df[final_player_winpct_df['Year'].isin([2001,2002,2003,2004])]

# The 2005-2009 era
era2_end_2009 = final_player_winpct_df[final_player_winpct_df['Year'].isin([2005,2006,2007,2008, 2009])]

# The 2010-2014 era
era3_end_2014 = final_player_winpct_df[final_player_winpct_df['Year'].isin([2010,2011,2012,2013,2014])]

# The 2015-2019 era
era4_end_2019 = final_player_winpct_df[final_player_winpct_df['Year'].isin([2015,2016,2017,2018,2019])]

# Put all era's dfs in a list
eras = [era1_end_2004, era2_end_2009, era3_end_2014, era4_end_2019]

for idx, era in enumerate(eras):
    era = dataprep_linreg(era)
    
    print("ERA", idx+1, "LINEAR REGRESSION RESULTS: ")
    runlinreg(era)
    print()

ERA 1 LINEAR REGRESSION RESULTS: 
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                win_pct   R-squared:                       0.183
Model:                            OLS   Adj. R-squared:                  0.170
Method:                 Least Squares   F-statistic:                     13.81
Date:                Tue, 15 Dec 2020   Prob (F-statistic):           3.99e-22
Time:                        19:57:15   Log-Likelihood:                 379.33
No. Observations:                 627   AIC:                            -736.7
Df Residuals:                     616   BIC:                            -687.8
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Constant       0.0599      0.091      0.658      0.511      -0.119       0.239
efPTS          0.1887      0.067      2.815      0.005       0.057       0.320
efTRB          0.1524      0.119      1.281      0.201      -0.081       0.386
efAST          0.7478      0.146      5.111      0.000       0.460       1.035
3P%            0.0096      0.043      0.224      0.823      -0.074       0.093
2P%            0.9283      0.144      6.466      0.000       0.646       1.210
FT%           -0.0370      0.072     -0.514      0.608      -0.178       0.104
efSTL          0.6727      0.488      1.379      0.168      -0.285       1.631
efBLK          0.4282      0.411      1.042      0.298      -0.379       1.236
efTOV         -2.8124      0.450     -6.251      0.000      -3.696      -1.929
efPF          -0.1902      0.283     -0.672      0.502      -0.746       0.366
==============================================================================
Omnibus:                       22.496   Durbin-Watson:                   0.682
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               15.251
Skew:                          -0.259   Prob(JB):                     0.000488
Kurtosis:                       2.438   Cond. No.                         135.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

ERA 2 LINEAR REGRESSION RESULTS: 
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                win_pct   R-squared:                       0.166
Model:                            OLS   Adj. R-squared:                  0.156
Method:                 Least Squares   F-statistic:                     16.33
Date:                Tue, 15 Dec 2020   Prob (F-statistic):           4.32e-27
Time:                        19:57:15   Log-Likelihood:                 465.86
No. Observations:                 833   AIC:                            -909.7
Df Residuals:                     822   BIC:                            -857.7
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Constant       0.0885      0.080      1.106      0.269      -0.069       0.246
efPTS          0.1833      0.057      3.240      0.001       0.072       0.294
efTRB          0.0704      0.102      0.688      0.491      -0.130       0.271
efAST          0.7331      0.129      5.687      0.000       0.480       0.986
3P%            0.0619      0.037      1.658      0.098      -0.011       0.135
2P%            0.8890      0.112      7.965      0.000       0.670       1.108
FT%           -0.0530      0.066     -0.807      0.420      -0.182       0.076
efSTL          0.3542      0.457      0.775      0.439      -0.543       1.252
efBLK          0.9185      0.351      2.620      0.009       0.230       1.607
efTOV         -3.0317      0.404     -7.502      0.000      -3.825      -2.238
efPF          -0.1680      0.247     -0.681      0.496      -0.652       0.316
==============================================================================
Omnibus:                        8.185   Durbin-Watson:                   0.598
Prob(Omnibus):                  0.017   Jarque-Bera (JB):                5.651
Skew:                          -0.040   Prob(JB):                       0.0593
Kurtosis:                       2.604   Cond. No.                         139.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

ERA 3 LINEAR REGRESSION RESULTS: 
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                win_pct   R-squared:                       0.113
Model:                            OLS   Adj. R-squared:                  0.102
Method:                 Least Squares   F-statistic:                     10.60
Date:                Tue, 15 Dec 2020   Prob (F-statistic):           4.90e-17
Time:                        19:57:15   Log-Likelihood:                 420.87
No. Observations:                 844   AIC:                            -819.7
Df Residuals:                     833   BIC:                            -767.6
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Constant       0.0739      0.084      0.876      0.381      -0.092       0.239
efPTS          0.1381      0.063      2.203      0.028       0.015       0.261
efTRB          0.0224      0.107      0.210      0.834      -0.187       0.232
efAST          0.3370      0.141      2.395      0.017       0.061       0.613
3P%            0.1328      0.043      3.096      0.002       0.049       0.217
2P%            0.6828      0.121      5.636      0.000       0.445       0.921
FT%            0.0294      0.069      0.428      0.669      -0.105       0.164
efSTL          1.1579      0.483      2.396      0.017       0.209       2.106
efBLK          1.4675      0.422      3.478      0.001       0.639       2.296
efTOV         -1.4354      0.437     -3.281      0.001      -2.294      -0.577
efPF          -0.6758      0.282     -2.396      0.017      -1.229      -0.122
==============================================================================
Omnibus:                       46.129   Durbin-Watson:                   0.568
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               18.473
Skew:                          -0.058   Prob(JB):                     9.74e-05
Kurtosis:                       2.284   Cond. No.                         143.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

ERA 4 LINEAR REGRESSION RESULTS: 
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                win_pct   R-squared:                       0.162
Model:                            OLS   Adj. R-squared:                  0.152
Method:                 Least Squares   F-statistic:                     17.03
Date:                Tue, 15 Dec 2020   Prob (F-statistic):           1.81e-28
Time:                        19:57:15   Log-Likelihood:                 497.24
No. Observations:                 893   AIC:                            -972.5
Df Residuals:                     882   BIC:                            -919.7
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Constant       0.0202      0.071      0.285      0.776      -0.119       0.159
efPTS          0.1270      0.056      2.275      0.023       0.017       0.237
efTRB          0.1147      0.085      1.350      0.177      -0.052       0.281
efAST          0.7271      0.129      5.630      0.000       0.474       0.981
3P%            0.1296      0.045      2.856      0.004       0.041       0.219
2P%            0.7051      0.091      7.715      0.000       0.526       0.884
FT%            0.0551      0.059      0.931      0.352      -0.061       0.171
efSTL          0.5358      0.402      1.334      0.183      -0.253       1.324
efBLK          0.9468      0.392      2.417      0.016       0.178       1.716
efTOV         -2.1727      0.397     -5.470      0.000      -2.952      -1.393
efPF          -0.3740      0.277     -1.353      0.177      -0.917       0.169
==============================================================================
Omnibus:                       15.168   Durbin-Watson:                   0.595
Prob(Omnibus):                  0.001   Jarque-Bera (JB):                9.989
Skew:                          -0.115   Prob(JB):                      0.00677
Kurtosis:                       2.536   Cond. No.                         137.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Results

By using our Hypothesis Testing methodology from before, I determined the following predictors reject the null hypothesis and are significant in predicting win_pct when holding all other predictors constant. (I did this by just selecting the predictors in each era's Linear Regression summary that had a p-value < 0.05):

Era 1 (2000-2004):

efPTS (p-val = 0.005)

efAST (p-val = 0.000)

2P% (p-val = 0.000)

efTOV (p-val = 0.000)

Era 2 (2005-2009):

efPTS (p-val = 0.001)

efAST (p-val = 0.000)

2P% (p-val = 0.000)

efBLK (p-val = 0.009)

efTOV (p-val = 0.000)

Era 3 (2010-2014):

efPTS (p-val = 0.028)

efAST (p-val = 0.017)

3P% (p-val = 0.002)

2P% (p-val = 0.000)

efSTL (p-val = 0.017)

efBLK (p-val = 0.001)

efTOV (p-val = 0.001)

efPF (p-val = 0.017)

Era 4 (2015-2019):

efPTS (p-val = 0.023)

efAST (p-val = 0.000)

3P% (p-val = 0.004)

2P% (p-val = 0.000)

efBLK (p-val = 0.016)

efTOV (p-val = 0.000)

Conclusion Preamble

Before addressing our questions from the beginning of this study, let's discuss how to interpret and compare the Linear Regression results from the 4 eras. It is hard to use coefficients to compare across eras because, although they are all efficiency metrics, the 3P%, 2P%, and FT% predictors use shots taken as the standardizer, while the other predictors use minutes played. Additionally, these coefficients are hard to use and interpret like we did with our Simple Linear Regression with just one predictor. This is because all of these predictors are in the range 0-1 (0%-100%), and our standard tactic of talking about the change in response after 1 unit of increase in the predictors does not make sense because 1 unit of increase is a whole 100%.

Let's just keep things simple and use the sign of the coefficients (positive or negative) and the magnitude of the p-value (analogous to and derived from the t-statistic) to help make comparisons and observations. A smaller p-value indicates STRONGER evidence to reject the null hypothesis, making predictors with relatively smaller p-values less likely to not actually be a significant predictor compared to other predictors. Read more about the exact definition and interpretation of p-values here: https://www.statsdirect.com/help/basics/p_values.htm.

Also, it is important to remember that the efPTS predictor is not highly-correlated with 3P%, 2P%, and FT%, since the former is based on minutes played, and the latter is based on shot attempts. Obviously, minutes played and shot attempts are themselves related, but overall this is not as big of an issue as predictors directly derivable from other predictors.

CONCLUSION

My hypothesis that 3-point efficiency would be a significant predictor of success in ALL ERAS was WRONG! As per the results above, 3-point efficiency only became a significant predictor in eras 3 & 4 (2010-2019). I think this result is indicative of the growing power of the 3-point shot. In eras 1 & 2 (2000-2009), the list of significant predictors was much smaller than later eras, and was largely made up of predictors that strictly related to just putting the ball in the hoop to score more points to win more games (points scoring efficiency, 2-pointer scoring efficiency, assists efficiency). In the more recent eras, however, 3-point efficiency started to become significant in its own right even in the context of these staple predictors.

My other hypothesis regarding to what extent 3-point efficiency DOMINATES other predictors is hard to tell with this approach. Although 3P% does have a lower p-value compared to other predictors in the later eras (3 & 4), I still don't feel comfortable using that as a definitive measure. However, I can at least see that 3-point efficiency has become MORE dominant/important as time goes on because of the observations for our other hypothesis question.

Things to Improve On

We need to find a better way to represent our predictors so that interpretation and comparisons are easier across eras.

Alternatively, we could find some other regression method that gives us some output statistics on each predictor that can be used universally to compare with the output of other eras.

It would also help to get more domain knowledge on applying Data Science to Basketball Statistics. The following Youtube video and channel dive into the intersection of basketball and analytics, and what are some mistakes to avoid when using Per-Game statistics. I tried my best to follow these guidelines for this dataset.

https://www.youtube.com/watch?v=pznoCFs7XZg&ab_channel=ThinkingBasketball

	Rk	Team	Overall	Home	Road	E	W	A	C	SE	...	Post	≤3	≥10	Oct	Nov	Dec	Jan	Feb	Mar	Apr
0	1	Milwaukee Bucks	60-22	33-8	27-14	40-12	20-10	13-5	14-2	13-5	...	17-8	5-6	45-5	7-0	8-6	10-4	12-3	10-1	10-6	3-2
1	2	Toronto Raptors	58-24	32-9	26-15	36-16	22-8	12-4	10-8	14-4	...	15-8	11-7	33-9	7-1	12-3	8-7	10-5	8-1	9-6	4-1
2	3	Golden State Warriors	57-25	30-11	27-14	22-8	35-17	6-4	8-2	8-2	...	16-9	7-7	34-10	8-1	7-7	10-5	11-2	7-4	9-5	5-1
3	4	Denver Nuggets	54-28	34-7	20-21	20-10	34-18	7-3	6-4	7-3	...	15-10	13-3	23-11	6-1	9-6	8-4	12-4	7-4	9-6	3-3
4	5	Houston Rockets	53-29	31-10	22-19	21-9	32-20	8-2	6-4	7-3	...	20-5	5-7	29-12	1-5	9-6	11-4	8-6	8-4	12-3	4-1
5	6	Portland Trail Blazers	53-29	32-9	21-20	24-6	29-23	9-1	8-2	7-3	...	19-6	4-6	29-8	5-2	8-7	8-7	11-4	6-3	10-5	5-1
6	7	Philadelphia 76ers	51-31	31-10	20-21	31-21	20-10	8-8	12-6	11-7	...	14-10	10-8	22-16	4-4	12-4	7-6	11-4	6-4	9-5	2-4
7	8	Utah Jazz	50-32	29-12	21-20	20-10	30-22	6-4	7-3	7-3	...	18-7	0-7	34-12	4-3	7-9	7-7	11-4	6-3	11-4	4-2
8	9	Boston Celtics	49-33	28-13	21-20	35-17	14-16	10-6	13-5	12-6	...	12-12	5-6	24-12	5-2	7-8	9-5	11-4	5-6	8-7	4-1
9	10	Oklahoma City Thunder	49-33	27-14	22-19	21-9	28-24	6-4	8-2	7-3	...	12-13	6-7	23-12	2-4	12-3	9-6	9-5	6-5	6-10	5-0
10	11	Indiana Pacers	48-34	29-12	19-22	33-19	15-15	9-9	11-5	13-5	...	10-14	6-6	23-16	5-3	8-6	12-3	7-7	9-3	4-10	3-2
11	12	Los Angeles Clippers	48-34	26-15	22-19	20-10	28-24	6-4	7-3	7-3	...	16-7	6-2	23-18	4-3	11-3	6-9	7-9	6-5	13-2	1-3
12	13	San Antonio Spurs	48-34	32-9	16-25	18-12	30-22	6-4	7-3	5-5	...	15-8	7-4	25-16	5-2	5-10	11-5	10-5	3-7	10-4	4-1
13	14	Brooklyn Nets	42-40	23-18	19-22	29-23	13-17	8-8	10-8	11-7	...	12-11	12-8	16-22	3-5	5-10	9-6	11-4	4-6	7-7	3-2
14	15	Orlando Magic	42-40	25-16	17-24	30-22	12-18	11-7	9-9	10-6	...	15-8	5-6	21-21	2-5	9-7	5-8	5-11	8-3	9-5	4-1
15	16	Detroit Pistons	41-41	26-15	15-26	27-25	14-16	10-8	8-8	9-9	...	15-11	6-9	16-21	4-3	8-4	4-11	6-10	7-3	10-6	2-4
16	17	Charlotte Hornets	39-43	25-16	14-27	29-23	10-20	8-10	11-7	10-6	...	12-13	6-10	21-20	4-4	7-7	7-7	6-8	4-7	7-8	4-2
17	18	Miami Heat	39-43	19-22	20-21	23-29	16-14	7-11	9-9	7-9	...	13-13	6-9	16-20	3-4	5-9	9-5	7-7	3-9	11-4	1-5
18	19	Sacramento Kings	39-43	24-17	15-26	18-12	21-31	3-7	7-3	8-2	...	9-16	6-7	16-18	5-3	5-8	9-6	7-8	5-5	7-9	1-4
19	20	Los Angeles Lakers	37-45	22-19	15-26	12-18	25-27	1-9	5-5	6-4	...	9-16	5-4	16-25	3-5	10-4	8-7	6-9	3-6	5-11	2-3
20	21	Minnesota Timberwolves	36-46	25-16	11-30	14-16	22-30	4-6	5-5	5-5	...	9-16	8-5	19-22	4-4	7-7	6-9	8-6	4-7	5-9	2-4
21	22	Dallas Mavericks	33-49	24-17	9-32	15-15	18-34	4-6	6-4	5-5	...	7-18	7-7	12-22	2-6	8-4	7-9	6-9	4-6	3-12	3-3
22	23	Memphis Grizzlies	33-49	21-20	12-29	9-21	24-28	3-7	3-7	3-7	...	10-13	6-9	14-19	4-2	9-6	5-10	2-14	4-7	7-7	2-3
23	24	New Orleans Pelicans	33-49	19-22	14-27	10-20	23-29	3-7	5-5	2-8	...	7-16	4-6	15-20	4-3	7-9	6-9	6-8	4-7	5-10	1-3
24	25	Washington Wizards	32-50	22-19	10-31	19-33	13-17	6-12	6-12	7-9	...	8-16	5-6	14-23	1-6	7-8	6-9	8-6	3-7	7-10	0-4
25	26	Atlanta Hawks	29-53	17-24	12-29	16-36	13-17	4-14	6-12	6-10	...	10-14	9-6	7-31	2-5	3-13	6-7	5-9	5-7	7-8	1-4
26	27	Chicago Bulls	22-60	9-32	13-28	16-36	6-24	4-14	3-13	9-9	...	8-16	8-8	9-31	2-6	3-12	5-9	2-13	5-5	4-11	1-4
27	28	Cleveland Cavaliers	19-63	13-28	6-35	15-37	4-26	6-12	4-12	5-13	...	7-17	5-4	6-43	1-6	3-11	4-12	3-12	4-6	4-11	0-5
28	29	Phoenix Suns	19-63	12-29	7-34	8-22	11-41	3-7	3-7	2-8	...	8-15	5-5	5-40	1-6	3-12	5-11	2-13	1-8	5-10	2-3
29	30	New York Knicks	17-65	9-32	8-33	11-41	6-24	2-14	3-15	6-12	...	6-18	4-7	6-41	2-6	5-10	2-12	1-12	3-9	1-13	3-3

	Abbrev	Franchise
1	ATL	Atlanta Hawks
2	BKN	Brooklyn Nets
3	BOS	Boston Celtics
4	CHA	Charlotte Hornets
5	CHI	Chicago Bulls
6	CLE	Cleveland Cavaliers
7	DAL	Dallas Mavericks
8	DEN	Denver Nuggets
9	DET	Detroit Pistons
10	GSW	Golden State Warriors
11	HOU	Houston Rockets
12	IND	Indiana Pacers
13	LAC	Los Angeles Clippers
14	LAL	Los Angeles Lakers
15	MEM	Memphis Grizzlies
16	MIA	Miami Heat
17	MIL	Milwaukee Bucks
18	MIN	Minnesota Timberwolves
19	NOP	New Orleans Pelicans
20	NYK	New York Knicks
21	OKC	Oklahoma City Thunder
22	ORL	Orlando Magic
23	PHI	Philadelphia 76ers
24	PHX	Phoenix Suns
25	POR	Portland Trail Blazers
26	SAC	Sacramento Kings
27	SAS	San Antonio Spurs
28	TOR	Toronto Raptors
29	UTA	Utah Jazz
30	WAS	Washington Wizards

	Abbrev	Team	W	L	win_pct
0	MIL	Milwaukee Bucks	60.0	22.0	0.731707
1	TOR	Toronto Raptors	58.0	24.0	0.707317
2	GSW	Golden State Warriors	57.0	25.0	0.695122
3	DEN	Denver Nuggets	54.0	28.0	0.658537
4	HOU	Houston Rockets	53.0	29.0	0.646341
5	POR	Portland Trail Blazers	53.0	29.0	0.646341
6	PHI	Philadelphia 76ers	51.0	31.0	0.621951
7	UTA	Utah Jazz	50.0	32.0	0.609756
8	BOS	Boston Celtics	49.0	33.0	0.597561
9	OKC	Oklahoma City Thunder	49.0	33.0	0.597561
10	IND	Indiana Pacers	48.0	34.0	0.585366
11	LAC	Los Angeles Clippers	48.0	34.0	0.585366
12	SAS	San Antonio Spurs	48.0	34.0	0.585366
13	BKN	Brooklyn Nets	42.0	40.0	0.512195
14	ORL	Orlando Magic	42.0	40.0	0.512195
15	DET	Detroit Pistons	41.0	41.0	0.500000
16	CHA	Charlotte Hornets	39.0	43.0	0.475610
17	MIA	Miami Heat	39.0	43.0	0.475610
18	SAC	Sacramento Kings	39.0	43.0	0.475610
19	LAL	Los Angeles Lakers	37.0	45.0	0.451220
20	MIN	Minnesota Timberwolves	36.0	46.0	0.439024
21	DAL	Dallas Mavericks	33.0	49.0	0.402439
22	MEM	Memphis Grizzlies	33.0	49.0	0.402439
23	NOP	New Orleans Pelicans	33.0	49.0	0.402439
24	WAS	Washington Wizards	32.0	50.0	0.390244
25	ATL	Atlanta Hawks	29.0	53.0	0.353659
26	CHI	Chicago Bulls	22.0	60.0	0.268293
27	CLE	Cleveland Cavaliers	19.0	63.0	0.231707
28	PHX	Phoenix Suns	19.0	63.0	0.231707
29	NYK	New York Knicks	17.0	65.0	0.207317

	Year	Player	Tm	start_pct	MP	PTS	TRB	AST	FGA	FG%	...	2PA	2P%	FTA	FT%	STL	BLK	TOV	PF	start_pct	win_pct
1	2019	Steven Adams	OKC	1.000000	33.4	13.9	9.5	1.6	10.1	.595	...	10.1	.596	3.7	.500	1.5	1.0	1.7	2.6	1.000000	0.597561
7	2019	Terrance Ferguson	OKC	1.000000	26.1	6.9	1.9	1.0	5.8	.429	...	1.9	.560	0.7	.725	0.5	0.2	0.6	3.1	1.000000	0.597561
8	2019	Paul George	OKC	1.000000	36.9	28.0	8.2	4.1	21.0	.438	...	11.1	.484	7.0	.839	2.2	0.4	2.7	2.8	1.000000	0.597561
9	2019	Jerami Grant	OKC	0.962500	32.7	13.6	5.2	1.0	10.3	.497	...	6.6	.555	2.8	.710	0.8	1.3	0.8	2.7	0.962500	0.597561
17	2019	Russell Westbrook	OKC	1.000000	36.0	22.9	11.1	10.7	20.2	.428	...	14.5	.481	6.2	.656	1.9	0.5	4.5	3.4	1.000000	0.597561
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
543	2019	Reggie Bullock	DET	1.000000	30.8	12.1	2.8	2.5	10.0	.413	...	3.3	.463	1.5	.875	0.5	0.1	1.2	1.8	1.000000	0.500000
545	2019	Andre Drummond	DET	1.000000	33.5	17.3	15.6	1.4	13.3	.533	...	12.8	.548	5.2	.590	1.7	1.7	2.2	3.4	1.000000	0.500000
547	2019	Wayne Ellington	DET	0.928571	27.3	12.0	2.1	1.5	9.8	.421	...	2.0	.607	1.2	.758	1.1	0.1	0.9	1.9	0.928571	0.500000
549	2019	Blake Griffin	DET	1.000000	35.0	24.5	7.5	5.4	17.9	.462	...	10.9	.525	7.3	.753	0.7	0.4	3.4	2.7	1.000000	0.500000
550	2019	Reggie Jackson	DET	1.000000	27.9	15.4	2.6	4.2	12.8	.421	...	7.0	.464	2.9	.864	0.7	0.1	1.8	2.5	1.000000	0.500000

	Abbrev	Franchise
1	ATL	Atlanta Hawks
2	BKN	Brooklyn Nets
3	BOS	Boston Celtics
4	CHA	Charlotte Hornets
5	CHI	Chicago Bulls
6	CLE	Cleveland Cavaliers
7	DAL	Dallas Mavericks
8	DEN	Denver Nuggets
9	DET	Detroit Pistons
10	GSW	Golden State Warriors
11	HOU	Houston Rockets
12	IND	Indiana Pacers
13	LAC	Los Angeles Clippers
14	LAL	Los Angeles Lakers
15	MEM	Memphis Grizzlies
16	MIA	Miami Heat
17	MIL	Milwaukee Bucks
18	MIN	Minnesota Timberwolves
19	NOP	New Orleans Pelicans
20	NYK	New York Knicks
21	OKC	Oklahoma City Thunder
22	ORL	Orlando Magic
23	PHI	Philadelphia 76ers
24	PHX	Phoenix Suns
25	POR	Portland Trail Blazers
26	SAC	Sacramento Kings
27	SAS	San Antonio Spurs
28	TOR	Toronto Raptors
29	UTA	Utah Jazz
30	WAS	Washington Wizards

	Rk	Player	Pos	Age	Tm	G	GS	MP	FG	FGA	...	FT%	ORB	DRB	TRB	AST	STL	BLK	TOV	PF	PTS
0	1	Álex Abrines	SG	25	OKC	31	2	19.0	1.8	5.1	...	.923	0.2	1.4	1.5	0.6	0.5	0.2	0.5	1.7	5.3
1	2	Quincy Acy	PF	28	PHO	10	0	12.3	0.4	1.8	...	.700	0.3	2.2	2.5	0.8	0.1	0.4	0.4	2.4	1.7
2	3	Jaylen Adams	PG	22	ATL	34	1	12.6	1.1	3.2	...	.778	0.3	1.4	1.8	1.9	0.4	0.1	0.8	1.3	3.2
3	4	Steven Adams	C	25	OKC	80	80	33.4	6.0	10.1	...	.500	4.9	4.6	9.5	1.6	1.5	1.0	1.7	2.6	13.9
4	5	Bam Adebayo	C	21	MIA	82	28	23.3	3.4	5.9	...	.735	2.0	5.3	7.3	2.2	0.9	0.8	1.5	2.5	8.9
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
729	528	Tyler Zeller	C	29	MEM	4	1	20.5	4.0	7.0	...	.778	2.3	2.3	4.5	0.8	0.3	0.8	1.0	4.0	11.5
730	529	Ante Žižić	C	22	CLE	59	25	18.3	3.1	5.6	...	.705	1.8	3.6	5.4	0.9	0.2	0.4	1.0	1.9	7.8
731	530	Ivica Zubac	C	21	TOT	59	37	17.6	3.6	6.4	...	.802	1.9	4.2	6.1	1.1	0.2	0.9	1.2	2.3	8.9
732	530	Ivica Zubac	C	21	LAL	33	12	15.6	3.4	5.8	...	.864	1.6	3.3	4.9	0.8	0.1	0.8	1.0	2.2	8.5
733	530	Ivica Zubac	C	21	LAC	26	25	20.2	3.8	7.2	...	.733	2.3	5.3	7.7	1.5	0.4	0.9	1.4	2.5	9.4

	Year	Player	Tm	start_pct	MP	PTS	TRB	AST	FGA	FG%	...	2PA	2P%	FTA	FT%	STL	BLK	TOV	PF	start_pct	win_pct
0	2001	Shareef Abdur-Rahim	VAN	1.000000	40.0	20.5	9.1	3.1	15.8	0.472	...	15.0	0.487	6.6	0.834	1.1	1.0	2.9	2.9	1.000000	0.280488
1	2001	Mike Bibby	VAN	1.000000	38.9	15.9	3.7	8.4	14.1	0.454	...	10.6	0.478	2.3	0.761	1.3	0.1	3.0	1.8	1.000000	0.280488
2	2001	Michael Dickerson	VAN	0.985714	37.4	16.3	3.3	3.3	14.6	0.417	...	11.3	0.429	3.9	0.763	0.9	0.4	2.3	3.0	0.985714	0.280488
3	2001	Othella Harrington	VAN	0.909091	28.8	10.9	6.6	0.8	8.8	0.466	...	8.7	0.470	3.5	0.779	0.4	0.6	2.4	3.1	0.909091	0.280488
4	2001	Bryant Reeves	VAN	0.640000	24.4	8.3	6.0	1.1	7.4	0.460	...	7.3	0.462	1.9	0.796	0.6	0.7	1.2	3.2	0.640000	0.280488
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3505	2019	Reggie Bullock	DET	1.000000	30.8	12.1	2.8	2.5	10.0	0.413	...	3.3	0.463	1.5	0.875	0.5	0.1	1.2	1.8	1.000000	0.500000
3506	2019	Andre Drummond	DET	1.000000	33.5	17.3	15.6	1.4	13.3	0.533	...	12.8	0.548	5.2	0.590	1.7	1.7	2.2	3.4	1.000000	0.500000
3507	2019	Wayne Ellington	DET	0.928571	27.3	12.0	2.1	1.5	9.8	0.421	...	2.0	0.607	1.2	0.758	1.1	0.1	0.9	1.9	0.928571	0.500000
3508	2019	Blake Griffin	DET	1.000000	35.0	24.5	7.5	5.4	17.9	0.462	...	10.9	0.525	7.3	0.753	0.7	0.4	3.4	2.7	1.000000	0.500000
3509	2019	Reggie Jackson	DET	1.000000	27.9	15.4	2.6	4.2	12.8	0.421	...	7.0	0.464	2.9	0.864	0.7	0.1	1.8	2.5	1.000000	0.500000

	efPTS	efTRB	efAST	FGA	FG%	3PA	3P%	2PA	2P%	FTA	FT%	efSTL	efBLK	efTOV	efPF	win_pct
0	0.512500	0.227500	0.077500	15.8	0.472	0.8	0.188	15.0	0.487	6.6	0.834	0.027500	0.025000	0.072500	0.072500	0.280488
1	0.408740	0.095116	0.215938	14.1	0.454	3.5	0.379	10.6	0.478	2.3	0.761	0.033419	0.002571	0.077121	0.046272	0.280488
2	0.435829	0.088235	0.088235	14.6	0.417	3.3	0.374	11.3	0.429	3.9	0.763	0.024064	0.010695	0.061497	0.080214	0.280488
3	0.378472	0.229167	0.027778	8.8	0.466	0.1	0.000	8.7	0.470	3.5	0.779	0.013889	0.020833	0.083333	0.107639	0.280488
4	0.340164	0.245902	0.045082	7.4	0.460	0.1	0.250	7.3	0.462	1.9	0.796	0.024590	0.028689	0.049180	0.131148	0.280488