Project Description¶
In this project, we will analyze and gain useful insights from data related to the biggest American football event: the Super Bowl. Our goal is to explore and understand different aspects of the event, such as game outcomes, TV viewership, and halftime performances.
We will apply data manipulation and visualization techniques to work with real-world data. This will help us uncover interesting patterns and trends related to game results, television audiences, advertising costs, and halftime shows.
The data we are using was collected and cleaned from Wikipedia. It includes three CSV files with information about 52 Super Bowl games played up to the year 2018. By analyzing these datasets, we aim to understand the impact of the games on viewership, the performance of teams, and the entertainment aspect of halftime shows.
The Data¶
We are provided with three datasets. Below is a summary of each:
1. halftime_musicians.csv
¶
This dataset contains information about the artists who performed at the halftime shows during various Super Bowl games.
Column | Description |
---|---|
super_bowl |
The Super Bowl number (for example, 52 stands for Super Bowl LII). |
musician |
Name of the musician or music group that performed during the halftime show. |
num_songs |
Number of songs performed during the halftime show. |
2. super_bowls.csv
¶
This dataset includes detailed information about each Super Bowl game, such as:
- The date and location of the game
- The teams that played
- The final scores
- The difference in points between the winning and losing team (
difference_pts
)
3. tv.csv
¶
This dataset provides television-related information for each Super Bowl, including:
- Viewership numbers
- Household ratings
- Cost of advertisements
We will now begin our analysis to discover what makes the Super Bowl such a major event from both a sports and media perspective.
# Import required libs
import pandas as pd
from matplotlib import pyplot as plt
# Load the CSV data
tv = pd.read_csv("tv.csv")
# Display the data
tv.head()
super_bowl | network | avg_us_viewers | total_us_viewers | rating_household | share_household | rating_18_49 | share_18_49 | ad_cost | |
---|---|---|---|---|---|---|---|---|---|
0 | 52 | NBC | 103390000 | NaN | 43.1 | 68 | 33.4 | 78.0 | 5000000 |
1 | 51 | Fox | 111319000 | 172000000.0 | 45.3 | 73 | 37.1 | 79.0 | 5000000 |
2 | 50 | CBS | 111864000 | 167000000.0 | 46.6 | 72 | 37.7 | 79.0 | 5000000 |
3 | 49 | NBC | 114442000 | 168000000.0 | 47.5 | 71 | 39.1 | 79.0 | 4500000 |
4 | 48 | Fox | 112191000 | 167000000.0 | 46.7 | 69 | 39.3 | 77.0 | 4000000 |
Has TV viewership increased over time?¶
# Find the year with the highest TV viewership
plt.plot(tv.super_bowl, tv.avg_us_viewers)
plt.title('Average Number of US Viewers')
Text(0.5, 1.0, 'Average Number of US Viewers')
viewership_increased = True
print(f"Super Bowl viewership increased over time.")
Super Bowl viewership increased over time.
How many matches finished with a point difference greater than 40?¶
# Load the data
super_bowls = pd.read_csv("super_bowls.csv")
# Display the Super Bowls data
super_bowls.head()
date | super_bowl | venue | city | state | attendance | team_winner | winning_pts | qb_winner_1 | qb_winner_2 | coach_winner | team_loser | losing_pts | qb_loser_1 | qb_loser_2 | coach_loser | combined_pts | difference_pts | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2018-02-04 | 52 | U.S. Bank Stadium | Minneapolis | Minnesota | 67612 | Philadelphia Eagles | 41 | Nick Foles | NaN | Doug Pederson | New England Patriots | 33 | Tom Brady | NaN | Bill Belichick | 74 | 8 |
1 | 2017-02-05 | 51 | NRG Stadium | Houston | Texas | 70807 | New England Patriots | 34 | Tom Brady | NaN | Bill Belichick | Atlanta Falcons | 28 | Matt Ryan | NaN | Dan Quinn | 62 | 6 |
2 | 2016-02-07 | 50 | Levi's Stadium | Santa Clara | California | 71088 | Denver Broncos | 24 | Peyton Manning | NaN | Gary Kubiak | Carolina Panthers | 10 | Cam Newton | NaN | Ron Rivera | 34 | 14 |
3 | 2015-02-01 | 49 | University of Phoenix Stadium | Glendale | Arizona | 70288 | New England Patriots | 28 | Tom Brady | NaN | Bill Belichick | Seattle Seahawks | 24 | Russell Wilson | NaN | Pete Carroll | 52 | 4 |
4 | 2014-02-02 | 48 | MetLife Stadium | East Rutherford | New Jersey | 82529 | Seattle Seahawks | 43 | Russell Wilson | NaN | Pete Carroll | Denver Broncos | 8 | Peyton Manning | NaN | John Fox | 51 | 35 |
# Filter the data for point difference >40
difference = len(super_bowls[super_bowls["difference_pts"]>40])
print(f"The matches finishing with a point difference over 40 were: {difference}")
The matches finishing with a point difference over 40 were: 1
# We can also plot a histogram of point differences to visualize the result
plt.hist(super_bowls.difference_pts)
plt.xlabel('Point Difference')
plt.ylabel('Number of Super Bowls')
plt.show()
Who performed the most songs in Super Bowl halftime shows?¶
# Load the CSV data
halftime_musicians = pd.read_csv("halftime_musicians.csv")
# Display the data
halftime_musicians.head()
super_bowl | musician | num_songs | |
---|---|---|---|
0 | 52 | Justin Timberlake | 11.0 |
1 | 52 | University of Minnesota Marching Band | 1.0 |
2 | 51 | Lady Gaga | 7.0 |
3 | 50 | Coldplay | 6.0 |
4 | 50 | Beyoncé | 3.0 |
# Count halftime show songs for each musician
halftime_appearances = halftime_musicians.groupby('musician').sum('num_songs')
halftime_appearances = halftime_appearances.sort_values('num_songs', ascending=False)
halftime_appearances
super_bowl | num_songs | |
---|---|---|
musician | ||
Justin Timberlake | 90 | 12.0 |
Beyoncé | 97 | 10.0 |
Diana Ross | 30 | 10.0 |
Grambling State University Tiger Marching Band | 79 | 9.0 |
Bruno Mars | 98 | 9.0 |
... | ... | ... |
Doc Severinsen | 4 | 0.0 |
Southeast Missouri State Marching Band | 5 | 0.0 |
Ella Fitzgerald | 6 | 0.0 |
San Diego State University Marching Aztecs | 22 | 0.0 |
Judy Mallett | 8 | 0.0 |
111 rows × 2 columns
most_songs = "Justin Timberlake"