Project Description¶
Sports clothing is a fast-growing industry. In this project, we will use our exploratory data analysis skills to study product data for an online sports retail company.
The sports clothing and athleisure market is very large, valued at around $193 billion in 2021, with strong growth expected in the coming years. (Source: Statista Sports Apparel Market)
We will take on the role of product analysts working for an online sports clothing company. The company wants to understand how it can increase its revenue. To do this, we will explore product data including prices, reviews, descriptions, ratings, revenue, and website traffic. Our goal is to provide useful recommendations to the marketing and sales teams.
We will work with different types of data to answer important questions that help the company improve its revenue.
The data:¶
We have four datasets to explore:
brands.csv
Columns | Description |
---|---|
product_id |
Unique product identifier |
brand |
Brand name of the product |
finance.csv
Columns | Description |
---|---|
product_id |
Unique product identifier |
listing_price |
Original price of the product |
sale_price |
Discounted price of the product |
discount |
Discount off the listing price (as a decimal) |
revenue |
Revenue generated by the product |
info.csv
Columns | Description |
---|---|
product_name |
Name of the product |
product_id |
Unique product identifier |
description |
Description of the product |
reviews.csv
Columns | Description |
---|---|
product_id |
Unique product identifier |
rating |
Average rating of the product |
reviews |
Number of reviews for the product |
Using these datasets, we will analyze the relationships between product features, pricing, customer feedback, and sales performance. Our analysis will help the company find ways to increase sales and improve customer satisfaction.
# Importing libraries
import pandas as pd
# Loading the data
brands = pd.read_csv("brands.csv")
finance = pd.read_csv("finance.csv")
info = pd.read_csv("info.csv")
reviews = pd.read_csv("reviews.csv")
# Merge the data and drop null values
merged_df = info.merge(finance, on="product_id")
merged_df = merged_df.merge(reviews, on="product_id")
merged_df = merged_df.merge(brands, on="product_id")
merged_df.dropna(inplace=True)
# final merged data
merged_df.head()
product_name | product_id | description | listing_price | sale_price | discount | revenue | rating | reviews | brand | |
---|---|---|---|---|---|---|---|---|---|---|
1 | Women's adidas Originals Sleek Shoes | G27341 | A modern take on adidas sport heritage, tailor... | 75.99 | 37.99 | 0.5 | 1641.17 | 3.3 | 24.0 | Adidas |
2 | Women's adidas Swim Puka Slippers | CM0081 | These adidas Puka slippers for women's come wi... | 9.99 | 5.99 | 0.4 | 398.93 | 2.6 | 37.0 | Adidas |
3 | Women's adidas Sport Inspired Questar Ride Shoes | B44832 | Inspired by modern tech runners, these women's... | 69.99 | 34.99 | 0.5 | 2204.37 | 4.1 | 35.0 | Adidas |
4 | Women's adidas Originals Taekwondo Shoes | D98205 | This design is inspired by vintage Taekwondo s... | 79.99 | 39.99 | 0.5 | 5182.70 | 3.5 | 72.0 | Adidas |
5 | Women's adidas Sport Inspired Duramo Lite 2.0 ... | B75586 | Refine your interval training in these women's... | 47.99 | 19.20 | 0.6 | 1555.20 | 1.0 | 45.0 | Adidas |
What is the volume of products and average revenue for Adidas and Nike products based on listing price quartiles?¶
# Add price labels based on listing_price quartiles
twenty_fifth = merged_df["listing_price"].quantile(0.25)
median = merged_df["listing_price"].quantile(0.5)
seventy_fifth = merged_df["listing_price"].quantile(0.75)
maximum = merged_df["listing_price"].max()
merged_df["price_label"] = pd.cut(merged_df["listing_price"], bins=[0, twenty_fifth, median, seventy_fifth, maximum], labels=["Budget", "Average", "Expensive", "Elite"], include_lowest=True)
# Group by brand and price_label to get volume and mean revenue
adidas_vs_nike = merged_df.groupby(["brand", "price_label"], as_index=False).agg(
num_products=("price_label", "count"),
mean_revenue=("revenue", "mean")
).round(2)
adidas_vs_nike
C:\Users\newbe\AppData\Local\Temp\ipykernel_13972\2545225764.py:9: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. adidas_vs_nike = merged_df.groupby(["brand", "price_label"], as_index=False).agg(
brand | price_label | num_products | mean_revenue | |
---|---|---|---|---|
0 | Adidas | Budget | 574 | 2015.68 |
1 | Adidas | Average | 655 | 3035.30 |
2 | Adidas | Expensive | 759 | 4621.56 |
3 | Adidas | Elite | 587 | 8302.78 |
4 | Nike | Budget | 357 | 1596.33 |
5 | Nike | Average | 8 | 675.59 |
6 | Nike | Expensive | 47 | 500.56 |
7 | Nike | Elite | 130 | 1367.45 |
Do any differences exist between the word count of a product's description and its mean rating?¶
# Store the length of each description
merged_df["description_length"] = merged_df["description"].str.len()
# Upper description length limits
lengthes = [0, 100, 200, 300, 400, 500, 600, 700]
# Description length labels
labels = ["100", "200", "300", "400", "500", "600", "700"]
# Cut into bins
merged_df["description_length"] = pd.cut(merged_df["description_length"], bins=lengthes, labels=labels)
# Group by the bins
description_lengths = merged_df.groupby("description_length", as_index=False).agg(
mean_rating=("rating", "mean"),
total_reviews=("reviews", "sum")
).round(2)
description_lengths
C:\Users\newbe\AppData\Local\Temp\ipykernel_13972\2042162190.py:16: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. description_lengths = merged_df.groupby("description_length", as_index=False).agg(
description_length | mean_rating | total_reviews | |
---|---|---|---|
0 | 100 | 2.26 | 36.0 |
1 | 200 | 3.19 | 17719.0 |
2 | 300 | 3.28 | 76115.0 |
3 | 400 | 3.29 | 28994.0 |
4 | 500 | 3.35 | 4984.0 |
5 | 600 | 3.12 | 852.0 |
6 | 700 | 3.65 | 818.0 |