Project Description¶

Working with time series data is very important because it helps us analyze how things change over time. By studying trends, patterns, and seasonal effects in data, we can make forecasts, detect unusual events, and make better decisions.

In this project, we will use our pandas data manipulation skills to analyze time series data that tracks the tide levels of the Thames River over many years.

Time series data is common in many areas, such as watching stock prices, monitoring climate change, or tracking the spread of diseases during a pandemic. Here, we focus on tide data from the Thames River. We will perform calculations on the data to find summary statistics and then break down the time series into its main components for deeper analysis.

The original dataset comes from the British Oceanographic Data Center.

For this project, we will work with the file 10-11_London_Bridge.txt. This file contains comma-separated values showing water levels in the Thames River at London Bridge. Once we finish this project, we can apply the same methods to analyze similar data from other locations in the UK where tidal measurements are collected.

The TXT file contains three important variables, described below:

Variable Name Description Format
Date and time The date and time when the measurement was taken, in GMT. The tide gauge measures with one-minute accuracy. dd/mm/yyyy hh:mm:ss
Water level The height of the tide, either high or low water level, measured by the tide meter. The measurements are accurate to one centimetre. metres (using Admiralty Chart Datum, Ordnance Datum Newlyn, or Trinity High Water)
Flag Indicates whether the measurement is high water (1) or low water (0). Categorical (0 or 1)

Through this project, we will practice cleaning, summarizing, and analyzing time series data, which is a valuable skill in many fields including environmental science, finance, and health.

In [5]:
# Package imports
import pandas as pd                
In [22]:
def IQR(column): 
    """ Calculates the interquartile range (IQR) for a given DataFrame column using the quantile method """
    q25, q75 = column.quantile([0.25, 0.75])
    return q75-q25
In [7]:
# Load the data from London Bridge
lb = pd.read_csv('10-11_London_Bridge.txt')
lb.head(3)
Out[7]:
Date and time water level (m ODN) flag HW=1 or LW=0
0 01/05/1911 15:40:00 3.7130 1 NaN
1 02/05/1911 11:25:00 -2.9415 0 NaN
2 02/05/1911 16:05:00 3.3828 1 NaN

Find the mean, median, and interquartile range for high- and low-tide data.¶

In [12]:
# Take only the first three columns
df = lb.iloc[:, :3]

# Rename columns
df.columns = ['datetime', 'water_level', 'is_high_tide']

df.head(3)
Out[12]:
datetime water_level is_high_tide
0 01/05/1911 15:40:00 3.7130 1
1 02/05/1911 11:25:00 -2.9415 0
2 02/05/1911 16:05:00 3.3828 1
In [14]:
# Convert to datetime
df['datetime'] = pd.to_datetime(df['datetime'], dayfirst=True)

# Convert to float
df['water_level'] = df.water_level.astype(float)

# Create extra month and year columns for easy access
df['month'] = df['datetime'].dt.month
df['year'] = df['datetime'].dt.year

# Filter df for high and low tide
tide_high = df[df['is_high_tide'] == 1]
tide_low = df[df['is_high_tide'] == 0]

# Create summary statistics
high_statistics = tide_high['water_level'].agg(['mean', 'median', IQR])
low_statistics = tide_low['water_level'].agg(['mean', 'median', IQR])
In [15]:
high_statistics
Out[15]:
mean      3.318373
median    3.352600
IQR       0.743600
Name: water_level, dtype: float64
In [16]:
low_statistics
Out[16]:
mean     -2.383737
median   -2.412900
IQR       0.538200
Name: water_level, dtype: float64

Calculate the annual percentage of days with very high tide levels (90th percentile of high tide days) and low-tide days (below the 10th percentile).¶

In [17]:
# Calculate ratio of high tide days
all_high_days = tide_high.groupby('year')['water_level'].count()
very_high_days = tide_high[tide_high['water_level'] > tide_high['water_level'].quantile(0.90)].groupby('year')['water_level'].count()
very_high_ratio = (very_high_days/all_high_days).reset_index()

very_high_ratio
Out[17]:
year water_level
0 1911 0.004098
1 1912 0.032316
2 1913 0.082212
3 1914 0.055313
4 1915 0.045045
... ... ...
80 1991 0.096317
81 1992 0.103253
82 1993 0.145923
83 1994 0.150355
84 1995 0.170213

85 rows × 2 columns

In [18]:
# Calculate ratio of low tide days
all_low_days = tide_low.groupby('year')['water_level'].count()
very_low_days = tide_low[tide_low['water_level'] < tide_low['water_level'].quantile(0.10)].groupby('year')['water_level'].count()
very_low_ratio = (very_low_days/all_low_days).reset_index()

very_low_ratio
Out[18]:
year water_level
0 1911 0.060606
1 1912 0.066667
2 1913 0.022388
3 1914 0.039017
4 1915 0.033435
... ... ...
80 1991 0.150355
81 1992 0.107496
82 1993 0.112696
83 1994 0.106383
84 1995 0.107801

85 rows × 2 columns

Create a dictionary named solution with a summary of data analysis.¶

In [21]:
solution = {'high_statistics': high_statistics, 'low_statistics': low_statistics, 'very_high_ratio': very_high_ratio, 'very_low_ratio':very_low_ratio}
print(solution)
{'high_statistics': mean      3.318373
median    3.352600
IQR       0.743600
Name: water_level, dtype: float64, 'low_statistics': mean     -2.383737
median   -2.412900
IQR       0.538200
Name: water_level, dtype: float64, 'very_high_ratio':     year  water_level
0   1911     0.004098
1   1912     0.032316
2   1913     0.082212
3   1914     0.055313
4   1915     0.045045
..   ...          ...
80  1991     0.096317
81  1992     0.103253
82  1993     0.145923
83  1994     0.150355
84  1995     0.170213

[85 rows x 2 columns], 'very_low_ratio':     year  water_level
0   1911     0.060606
1   1912     0.066667
2   1913     0.022388
3   1914     0.039017
4   1915     0.033435
..   ...          ...
80  1991     0.150355
81  1992     0.107496
82  1993     0.112696
83  1994     0.106383
84  1995     0.107801

[85 rows x 2 columns]}