Project Description¶

This project explores the causes of flight delays and cancellations through data analysis. We will examine aviation data to identify the key factors and airlines most affected by disruptions in the Pacific Northwest by applying data wrangling, data visualization, and exploratory data analysis techniques.

A prominent airline company in the Pacific Northwest has collected extensive data on flights and weather conditions. They seek insights into what influences departure delays and cancellations, which will help improve both operational efficiency and passenger experience. As data analysts, we will carry out a comprehensive analysis to support their goals.

The aviation industry is complex, with various factors impacting flight operations. To ensure our findings are relevant and actionable, we will focus on data from the pnwflights2022 dataset provided by the ModernDive team. This dataset includes flight records from the first half of 2022 for two major airports in the region: Seattle-Tacoma International Airport (SEA) and Portland International Airport (PDX).

The data comes in two CSV files:

flights2022.csv¶

This file contains information about each flight, including:

Variable Description
dep_time Departure time (in hhmm format). A missing value indicates a cancellation
dep_delay Departure delay in minutes (negative values indicate early departures)
origin Origin airport code (IATA format)
airline Name of the airline or carrier
dest Destination airport code (IATA format)

flights_weather2022.csv¶

This file includes the same flight data as above, along with weather conditions at the time of departure:

Variable Description
visib Visibility in miles
wind_gust Wind gust speed in miles per hour

Objective¶

Our goal is to uncover patterns and insights related to flight delays and cancellations in the Pacific Northwest. We will analyze which variables, including weather conditions and airline-specific patterns, most significantly affect flight performance.

Which airlines and routes are most affected by flight delays, and what impact does wind have on departure delays?¶

In [1]:
# Load required libraries
import pandas as pd
import matplotlib.pyplot as plt

# load the dataset
flights2022 = flights2022 = pd.read_csv("flights2022.csv")
flights_weather2022 = pd.read_csv("flights_weather2022.csv")

# Create route column
flights2022["route"] = flights2022["origin"] + "-" + flights2022["dest"]

flights2022.head()
Out[1]:
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier ... tailnum origin dest air_time distance hour minute time_hour airline route
0 2022 1 1 1.0 2359 2.0 604.0 618 -14.0 UA ... N405UA SEA IAH 221.0 1874 23 59 2022-01-01T23:00:00Z United Air Lines Inc. SEA-IAH
1 2022 1 1 1.0 2250 71.0 242.0 142 60.0 AS ... N265AK SEA FAI 193.0 1533 22 50 2022-01-01T22:00:00Z Alaska Airlines Inc. SEA-FAI
2 2022 1 1 10.0 2355 15.0 759.0 730 29.0 AS ... N274AK SEA ATL 261.0 2182 23 55 2022-01-01T23:00:00Z Alaska Airlines Inc. SEA-ATL
3 2022 1 1 25.0 2350 35.0 606.0 550 16.0 AS ... N281AK SEA ORD 193.0 1721 23 50 2022-01-01T23:00:00Z Alaska Airlines Inc. SEA-ORD
4 2022 1 1 35.0 2349 46.0 616.0 545 31.0 UA ... N426UA PDX ORD 196.0 1739 23 49 2022-01-01T23:00:00Z United Air Lines Inc. PDX-ORD

5 rows × 21 columns

In [2]:
flights_weather2022.head()
Out[2]:
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier ... route temp dewp humid wind_dir wind_speed wind_gust precip pressure visib
0 2022 1 1 1.0 2359 2.0 604.0 618 -14.0 UA ... SEA-IAH 33.0 23.0 66.06 160.0 8.05546 9.270062 0.0 1022.9 10.0
1 2022 1 1 1.0 2250 71.0 242.0 142 60.0 AS ... SEA-FAI 32.0 23.0 69.04 170.0 9.20624 10.594357 0.0 1023.4 10.0
2 2022 1 1 10.0 2355 15.0 759.0 730 29.0 AS ... SEA-ATL 33.0 23.0 66.06 160.0 8.05546 9.270062 0.0 1022.9 10.0
3 2022 1 1 25.0 2350 35.0 606.0 550 16.0 AS ... SEA-ORD 33.0 23.0 66.06 160.0 8.05546 9.270062 0.0 1022.9 10.0
4 2022 1 1 35.0 2349 46.0 616.0 545 31.0 UA ... PDX-ORD 33.0 19.0 55.75 120.0 6.90468 7.945768 0.0 1025.1 10.0

5 rows × 29 columns

Visualize to show the top 9 highest number of cancellations by route and the top 9 highest average departure delays by airline.¶

Analysis by route¶
In [18]:
# Calculate mean departure delay and number of canceled flights for each unique flight route
routes_delays_cancels = flights2022.groupby("route").agg(
    mean_dep_delay=("dep_delay", "mean"),
    total_cancellations=("dep_time", lambda x: x.isna().sum())
).reset_index()


# Identify routes with the highest mean departure delays
top_routes_by_delay = routes_delays_cancels.sort_values("mean_dep_delay", ascending=False).head(9)
top_routes_by_delay.head()
Out[18]:
route mean_dep_delay total_cancellations
13 PDX-DSM 35.782609 0
19 PDX-GRR 35.739130 2
17 PDX-FLL 30.893617 3
109 SEA-MIA 29.916667 4
71 SEA-CLT 27.313199 14
In [19]:
# Identify routes with the highest number of cancellations
top_routes_by_cancellations = routes_delays_cancels.sort_values("total_cancellations", ascending=False).head(9)
top_routes_by_cancellations.head()
Out[19]:
route mean_dep_delay total_cancellations
103 SEA-LAX 7.195969 93
58 SEA-ANC 9.316991 78
137 SEA-SFO 7.054403 73
102 SEA-LAS 8.943503 73
121 SEA-PDX 7.730982 71
In [20]:
# Create a bar graph for the highest number of cancellations by route
top9_route_cancels_bar, ax = plt.subplots()
ax.bar(top_routes_by_cancellations["route"], top_routes_by_cancellations["total_cancellations"])
ax.set_xlabel("Route")
ax.set_ylabel("Total Cancellations")
ax.set_title("Routes with Highest Number of Cancellations")
ax.set_xticklabels(top_routes_by_cancellations["route"], rotation=90)
plt.show()
plt.close()
C:\Users\newbe\AppData\Local\Temp\ipykernel_2412\3812851333.py:7: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  ax.set_xticklabels(top_routes_by_cancellations["route"], rotation=90)
No description has been provided for this image
Analysis by airline¶
In [25]:
# Find mean departure delays by airline
airlines_delays_cancels = flights2022.groupby("airline").agg(
    mean_dep_delay=("dep_delay", "mean"),
    total_cancellations=("dep_time", lambda x: x.isna().sum())
).reset_index()

# Identify airlines with the highest mean departure delay
top_airlines_by_delay = airlines_delays_cancels.sort_values("mean_dep_delay", ascending=False).head(9)
top_airlines_by_delay.head()
Out[25]:
airline mean_dep_delay total_cancellations
7 JetBlue Airways 42.141524 33
1 Allegiant Air 26.608696 6
2 American Airlines Inc. 20.129903 85
4 Frontier Airlines Inc. 14.355856 12
9 Southwest Airlines Co. 11.407163 100
In [26]:
# Identify airlines with the highest number of cancellations by airline
top_airlines_by_cancellations = airlines_delays_cancels.sort_values("total_cancellations", ascending=False).head(9)
top_airlines_by_cancellations.head()
Out[26]:
airline mean_dep_delay total_cancellations
0 Alaska Airlines Inc. 6.739282 1300
3 Delta Air Lines Inc. 10.648634 383
6 Horizon Air 6.270965 238
8 SkyWest Airlines Inc. 4.596561 165
9 Southwest Airlines Co. 11.407163 100
In [23]:
# Create a bar graph for highest mean departure delay by airline
top9_airline_delays_bar, ax = plt.subplots()
ax.bar(top_airlines_by_delay["airline"], top_airlines_by_delay["mean_dep_delay"])
ax.set_xlabel("Airline")
ax.set_ylabel("Mean Departure Delay")
ax.set_title("Airlines with Highest Mean Departure Delays")
ax.set_xticklabels(top_airlines_by_delay["airline"], rotation=90)
plt.show()
C:\Users\newbe\AppData\Local\Temp\ipykernel_2412\1878166829.py:7: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  ax.set_xticklabels(top_airlines_by_delay["airline"], rotation=90)
No description has been provided for this image

Determine if 10-mile-per-hour wind gusts or more have a larger average departure delay for both of SEA and PDX.¶

In [24]:
# Are departure delays impacted by 10+ mph winds from each airport
flights_weather2022["group"] = flights_weather2022["wind_gust"].apply(lambda x: ">= 10mph" if x >= 10 else "< 10 mph")
wind_grouped_data = flights_weather2022.groupby(["group", "origin"]).agg(
    mean_dep_delay=("dep_delay", "mean")
)

print(wind_grouped_data)
                 mean_dep_delay
group    origin                
< 10 mph PDX           6.686966
         SEA           7.857595
>= 10mph PDX           9.147024
         SEA           9.231060

Findings:¶

We have found that route PDX-DSM has the higest mean departure delay and route SEA-LAX has the higest number of flights cancellation.¶

We also found that airline JetBlue Airways has the highest mean departure dealay and airline Alaska Airlines Inc. has the highest number of flights cancellation.¶

We also found that wind effects the flight delays if it is more than 10mph.¶

Thank you!¶