Project Description¶

Major League Baseball (MLB) uses a technology called Statcast to collect detailed data about the exact location and movements of baseballs and players during games. In this project, we will use Statcast data to compare the home run performance of two of the biggest stars who played for the New York Yankees: Aaron Judge and Giancarlo Stanton. The data for this project comes from Baseball Savant.

Aaron Judge is one of the largest players in MLB. He stands 6 feet 7 inches (2.01 meters) tall and weighs 282 pounds (128 kilograms). He also hit one of the hardest home runs ever recorded, and we know this because of the precise measurements from Statcast.

Statcast is a cutting-edge tracking system introduced in 2015 at all 30 major league ballparks. It uses high-resolution cameras and radar to measure the exact position and movement of baseballs and players. This data has changed how baseball teams analyze the game, creating a new focus on data-driven decisions. Teams compete by hiring analysts to gain advantages over their opponents.

In this project, we will clean, analyze, and visualize Statcast data to compare Aaron Judge and his teammate Giancarlo Stanton. Both players are known for hitting many home runs. In 2017, Stanton hit 59 home runs and Judge hit 52, leading the league by a large margin (the third-place player had 45 home runs).

Although Stanton and Judge share some similarities, they are also different in many ways. This project will explore how their home run hitting styles and performance compare.

The Data¶

We will work with two CSV files: judge.csv and stanton.csv. These files contain Statcast data from 2015 to 2017. Each row in the files represents one pitch thrown to the batter.

Custom Functions¶

We have created two custom functions to help visualize home run zones:

  • assign_x_coord: This function assigns an x-coordinate based on Statcast’s strike zone numbering system.
  • assign_y_coord: This function assigns a y-coordinate based on Statcast’s strike zone numbering system.

These functions will help us map pitches to specific locations in the strike zone, making it easier to compare how Judge and Stanton hit home runs.

In [1]:
# Import the necessary packages
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load Aaron Judge's Statcast data
judge = pd.read_csv('judge.csv')

# Load Giancarlo Stanton's Statcast data
stanton = pd.read_csv('stanton.csv')

# Display all columns (pandas will collapse some columns if we don't set this option)
pd.set_option('display.max_columns', None)

# Custom Functions
def assign_x_coord(row):
    """
    Assigns an x-coordinate to Statcast's strike zone numbers. Zones 11, 12, 13,
    and 14 are ignored for plotting simplicity.
    """
    # Left third of strike zone
    if row.zone in [1, 4, 7]:
        return 1
    # Middle third of strike zone
    if row.zone in [2, 5, 8]:
        return 2
    # Right third of strike zone
    if row.zone in [3, 6, 9]:
        return 3
    
def assign_y_coord(row):
    """
    Assigns a y-coordinate to Statcast's strike zone numbers. Zones 11, 12, 13,
    and 14 are ignored for plotting simplicity.
    """
    # Upper third of strike zone
    if row.zone in [1, 2, 3]:
        return 3
    # Middle third of strike zone
    if row.zone in [4, 5, 6]:
        return 2
    # Lower third of strike zone
    if row.zone in [7, 8, 9]:
        return 1
    
# Display the last five rows of the Aaron Judge file
judge.tail()
Out[1]:
pitch_type game_date release_speed release_pos_x release_pos_z player_name batter pitcher events description spin_dir spin_rate_deprecated break_angle_deprecated break_length_deprecated zone des game_type stand p_throws home_team away_team type hit_location bb_type balls strikes game_year pfx_x pfx_z plate_x plate_z on_3b on_2b on_1b outs_when_up inning inning_topbot hc_x hc_y tfs_deprecated tfs_zulu_deprecated pos2_person_id umpire sv_id vx0 vy0 vz0 ax ay az sz_top sz_bot hit_distance_sc launch_speed launch_angle effective_speed release_spin_rate release_extension game_pk pos1_person_id pos2_person_id.1 pos3_person_id pos4_person_id pos5_person_id pos6_person_id pos7_person_id pos8_person_id pos9_person_id release_pos_y estimated_ba_using_speedangle estimated_woba_using_speedangle woba_value woba_denom babip_value iso_value launch_speed_angle at_bat_number pitch_number
3431 CH 2016-08-13 85.6 -1.9659 5.9113 Aaron Judge 592450 542882 NaN ball NaN NaN NaN NaN 14.0 NaN R R R NYY TB B NaN NaN 0 0 2016 -0.379108 0.370567 0.739 1.442 NaN NaN NaN 0 5 Bot NaN NaN NaN NaN 571912.0 NaN 160813_144259 6.960 -124.371 -4.756 -2.821 23.634 -30.220 3.93 1.82 NaN NaN NaN 84.459 1552.0 5.683 448611 542882.0 571912.0 543543.0 523253.0 446334.0 622110.0 545338.0 595281.0 543484.0 54.8144 0.00 0.000 NaN NaN NaN NaN NaN 36 1
3432 CH 2016-08-13 87.6 -1.9318 5.9349 Aaron Judge 592450 542882 home_run hit_into_play_score NaN NaN NaN NaN 4.0 Aaron Judge homers (1) on a fly ball to center... R R R NYY TB X NaN fly_ball 1 2 2016 -0.295608 0.320400 -0.419 3.273 NaN NaN NaN 2 2 Bot 130.45 14.58 NaN NaN 571912.0 NaN 160813_135833 4.287 -127.452 -0.882 -1.972 24.694 -30.705 4.01 1.82 446.0 108.8 27.410 86.412 1947.0 5.691 448611 542882.0 571912.0 543543.0 523253.0 446334.0 622110.0 545338.0 595281.0 543484.0 54.8064 0.98 1.937 2.0 1.0 0.0 3.0 6.0 14 4
3433 CH 2016-08-13 87.2 -2.0285 5.8656 Aaron Judge 592450 542882 NaN ball NaN NaN NaN NaN 14.0 NaN R R R NYY TB B NaN NaN 0 2 2016 -0.668575 0.198567 0.561 0.960 NaN NaN NaN 2 2 Bot NaN NaN NaN NaN 571912.0 NaN 160813_135815 7.491 -126.665 -5.862 -6.393 21.952 -32.121 4.01 1.82 NaN NaN NaN 86.368 1761.0 5.721 448611 542882.0 571912.0 543543.0 523253.0 446334.0 622110.0 545338.0 595281.0 543484.0 54.7770 0.00 0.000 NaN NaN NaN NaN NaN 14 3
3434 CU 2016-08-13 79.7 -1.7108 6.1926 Aaron Judge 592450 542882 NaN foul NaN NaN NaN NaN 4.0 NaN R R R NYY TB S NaN NaN 0 1 2016 0.397442 -0.614133 -0.803 2.742 NaN NaN NaN 2 2 Bot NaN NaN NaN NaN 571912.0 NaN 160813_135752 1.254 -116.062 0.439 5.184 21.328 -39.866 4.01 1.82 9.0 55.8 -24.973 77.723 2640.0 5.022 448611 542882.0 571912.0 543543.0 523253.0 446334.0 622110.0 545338.0 595281.0 543484.0 55.4756 0.00 0.000 NaN NaN NaN NaN 1.0 14 2
3435 FF 2016-08-13 93.2 -1.8476 6.0063 Aaron Judge 592450 542882 NaN called_strike NaN NaN NaN NaN 8.0 NaN R R R NYY TB S NaN NaN 0 0 2016 -0.823050 1.623300 -0.273 2.471 NaN NaN NaN 2 2 Bot NaN NaN NaN NaN 571912.0 NaN 160813_135736 5.994 -135.497 -6.736 -9.360 26.782 -13.446 4.01 1.82 NaN NaN NaN 92.696 2271.0 6.068 448611 542882.0 571912.0 543543.0 523253.0 446334.0 622110.0 545338.0 595281.0 543484.0 54.4299 0.00 0.000 NaN NaN NaN NaN NaN 14 1

How many of each event did Judge and Stanton have in 2017?¶

In [2]:
# All of Aaron Judge's batted ball events in 2017
judge_events_2017 = judge.loc[judge['game_year'] == 2017].events.value_counts()
print("Aaron Judge batted ball event totals, 2017:")
print(judge_events_2017)

# All of Giancarlo Stanton's batted ball events in 2017
stanton_events_2017 = stanton.loc[stanton['game_year'] == 2017].events.value_counts()
print("\nGiancarlo Stanton batted ball event totals, 2017:")
print(stanton_events_2017)
Aaron Judge batted ball event totals, 2017:
events
strikeout                    207
field_out                    146
walk                         116
single                        75
home_run                      52
double                        24
grounded_into_double_play     15
force_out                     11
intent_walk                   11
hit_by_pitch                   5
sac_fly                        4
fielders_choice_out            4
field_error                    4
triple                         3
strikeout_double_play          1
Name: count, dtype: int64

Giancarlo Stanton batted ball event totals, 2017:
events
field_out                    239
strikeout                    161
single                        77
walk                          72
home_run                      59
double                        32
intent_walk                   13
grounded_into_double_play     13
force_out                      7
hit_by_pitch                   7
field_error                    5
sac_fly                        3
fielders_choice_out            2
strikeout_double_play          2
pickoff_1b                     1
Name: count, dtype: int64

Which player hits home runs slightly lower and slightly harder?¶

In [5]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
In [6]:
# Filter to include home runs only
judge_hr = judge.loc[judge['events'] == 'home_run']
stanton_hr = stanton.loc[stanton['events'] == 'home_run']

# Create a figure with two KDE plots
fig1, ax1 = plt.subplots(ncols=2, sharex=True, sharey=True, figsize=(14, 6))

sns.kdeplot(
    x=judge_hr.launch_angle,
    y=judge_hr.launch_speed,
    cmap="Blues",
    fill=True,        
    thresh=0.05,       
    ax=ax1[0]
).set_title('Aaron Judge\nHome Runs, 2015-2017')

sns.kdeplot(
    x=stanton_hr.launch_angle,
    y=stanton_hr.launch_speed,
    cmap="Blues",
    fill=True,
    thresh=0.05,
    ax=ax1[1]
).set_title('Giancarlo Stanton\nHome Runs, 2015-2017')

plt.tight_layout()
plt.show()
No description has been provided for this image
In [7]:
player_hr = "Stanton"

Compare both players' pitch velocity, or release_speed. Which player has the highest median?¶

In [8]:
# Combine the Judge and Stanton home run DataFrames
judge_stanton_hr = pd.concat([judge_hr, stanton_hr])

# Create a boxplot 
sns.boxplot(x='player_name', y='release_speed', color='tab:blue', data=judge_stanton_hr).set_title('Home Runs, 2015-2017')
Out[8]:
Text(0.5, 1.0, 'Home Runs, 2015-2017')
No description has been provided for this image
In [9]:
player_fast = "Judge"

Create a 2D histogram for each player that visualizes the home run strike zones, ignoring zones 11, 12, 13, and 14 for simplicity.¶

In [11]:
# Zones 11, 12, 13, and 14 are to be ignored for plotting simplicity
judge_strike_hr = judge_hr.copy().loc[judge_hr.zone <= 9]

# Assign Cartesian coordinates to pitches in the strike zone for Judge home runs
judge_strike_hr['zone_x'] = judge_strike_hr.apply(assign_x_coord, axis=1)
judge_strike_hr['zone_y'] = judge_strike_hr.apply(assign_y_coord, axis=1)

# Plot Judge's home run zone as a 2D histogram with a colorbar and save the figure to a variable
plt.hist2d(judge_strike_hr.zone_x, judge_strike_hr.zone_y, bins = 3, cmap='Blues')
plt.title('Aaron Judge Home Runs on\n Pitches in the Strike Zone, 2015-2017')
plt.gca().get_xaxis().set_visible(False)
plt.gca().get_yaxis().set_visible(False)
cb = plt.colorbar()
cb.set_label('Counts in Bin')
No description has been provided for this image
In [12]:
# Zones 11, 12, 13, and 14 are to be ignored for plotting simplicity
stanton_strike_hr = stanton_hr.copy().loc[stanton_hr.zone <= 9]

# Assign Cartesian coordinates to pitches in the strike zone for Stanton home runs
stanton_strike_hr['zone_x'] = stanton_strike_hr.apply(assign_x_coord, axis=1)
stanton_strike_hr['zone_y'] = stanton_strike_hr.apply(assign_y_coord, axis=1)

# Plot Stanton's home run zone as a 2D histogram with a colorbar
plt.hist2d(stanton_strike_hr.zone_x, stanton_strike_hr.zone_y, bins = 3, cmap='Blues')
plt.title('Giancarlo Stanton Home Runs on\n Pitches in the Strike Zone, 2015-2017')
plt.gca().get_xaxis().set_visible(False)
plt.gca().get_yaxis().set_visible(False)
cb = plt.colorbar()
cb.set_label('Counts in Bin')
No description has been provided for this image