Project Description¶

In this project, we used our data analysis skills to understand where the ingredients in an avocado toast come from. We used data from the Open Food Facts database, which has detailed information about many food items, including their countries of origin.

We focused on three main ingredients:

  • Avocados
  • Olive oil
  • Sourdough bread (represented by lemons in the data as an example)

Each ingredient came with:

  • A CSV file (like avocado.csv) that has information about the product and where it comes from.
  • A TXT file (like relevant_avocado_categories.txt) that lists the important food categories we need to focus on.

What We Did¶

  • We filtered the data to keep only the rows that matched the relevant food categories (like fruits, oils, etc.) using the categories_tags column.
  • Each product often had more than one category, so we had to check all the tags in each row.
  • We then looked at the origins_tags column to find out the country of origin of each item.

Goal¶

By the end of this project, we created a list showing which countries the ingredients in avocado toast most commonly come from. This helps us better understand the global supply chain behind just one simple dish.

Our final task was to find the most common country of origin for:

  • Avocados
  • Olive oil
  • Sourdough bread
In [6]:
# import library and read tab-delimited data
import pandas as pd
avocado = pd.read_csv('avocado.csv', sep='\t')
avocado.head()
Out[6]:
code lc product_name_de product_name_el product_name_en product_name_es product_name_fi product_name_fr product_name_id product_name_it ... off:ecoscore_data.adjustments.packaging.non_recyclable_and_non_biodegradable_materials off:ecoscore_data.adjustments.production_system.value off:ecoscore_data.adjustments.threatened_species.value sources_fields:org-database-usda:available_date sources_fields:org-database-usda:fdc_category sources_fields:org-database-usda:fdc_data_source sources_fields:org-database-usda:fdc_id sources_fields:org-database-usda:modified_date sources_fields:org-database-usda:publication_date data_sources
0 0059749979702 fr NaN NaN NaN NaN NaN Naturalia Avocado Oil NaN NaN ... 1.0 0.0 NaN NaN NaN NaN NaN NaN NaN App - yuka, Apps
1 7610095131409 en NaN NaN NaN NaN NaN Avocado Bowl chips NaN NaN ... 1.0 0.0 NaN NaN NaN NaN NaN NaN NaN App - Yuka, Apps, Producers, Producer - zweifel
2 4005514005578 en NaN NaN Gelbe Linse Avocado Brotaufstrich NaN NaN NaN NaN NaN ... 1.0 15.0 NaN NaN NaN NaN NaN NaN NaN App - yuka, Apps, App - smoothie-openfoodfacts
3 0879890002513 en NaN NaN Avocado toast chili lime NaN NaN NaN NaN NaN ... 1.0 0.0 NaN NaN NaN NaN NaN NaN NaN App - Yuka, Apps, App - InFood
4 0223086613685 en NaN NaN Avocado NaN NaN NaN NaN NaN ... 1.0 0.0 NaN NaN NaN NaN NaN NaN NaN App - Yuka, Apps

5 rows × 184 columns

In [5]:
# Subset large DataFrame to include only relevant columns
subset_columns = [ 'code', 'lc', 'product_name_en', 'quantity', 'serving_size', 'packaging_tags', 'brands', 'brands_tags', 'categories_tags', 'labels_tags', 'countries', 'countries_tags', 'origins','origins_tags']
avocado = avocado[subset_columns]
avocado.head()
Out[5]:
code lc product_name_en quantity serving_size packaging_tags brands brands_tags categories_tags labels_tags countries countries_tags origins origins_tags
0 0059749979702 fr NaN NaN NaN NaN Naturalia naturalia en:plant-based-foods-and-beverages,en:plant-ba... NaN Canada en:canada NaN NaN
1 7610095131409 en NaN NaN NaN NaN Zweifel zweifel en:snacks,en:salty-snacks,en:appetizers,en:chi... en:vegetarian,en:vegan Switzerland, World en:switzerland,en:world NaN NaN
2 4005514005578 en Gelbe Linse Avocado Brotaufstrich NaN NaN NaN Tartex tartex de:abendbrotsufstrich en:organic,en:eu-organic,en:eg-oko-verordnung Germany en:germany NaN NaN
3 0879890002513 en Avocado toast chili lime NaN NaN NaN NaN NaN NaN NaN United States en:united-states NaN NaN
4 0223086613685 en Avocado NaN NaN NaN NaN NaN NaN NaN United States en:united-states NaN NaN
In [25]:
# Gather relevant categories data for avocados
with open("relevant_avocado_categories.txt", "r") as file:
    relevant_avocado_categories = file.read().splitlines()
    file.close()
In [26]:
relevant_avocado_categories
Out[26]:
['en:avocadoes',
 'en:avocados',
 'en:fresh-foods',
 'en:fresh-vegetables',
 'en:fruchte',
 'en:fruits',
 'en:raw-green-avocados',
 'en:tropical-fruits',
 'en:tropische-fruchte',
 'en:vegetables-based-foods',
 'fr:hass-avocados']
In [16]:
# Turn a column of comma-separated tags into a column of lists
avocado['categories_list'] = avocado['categories_tags'].str.split(',')

# Drop rows with null values in the newly created column
avocado = avocado.dropna(subset = 'categories_list')

# Filter a DataFrame based on a column of lists
avocado = avocado[avocado['categories_list'].apply(lambda x: any(i in relevant_avocado_categories for i in x))]

avocado.head()
Out[16]:
code lc product_name_de product_name_el product_name_en product_name_es product_name_fi product_name_fr product_name_id product_name_it ... off:ecoscore_data.adjustments.production_system.value off:ecoscore_data.adjustments.threatened_species.value sources_fields:org-database-usda:available_date sources_fields:org-database-usda:fdc_category sources_fields:org-database-usda:fdc_data_source sources_fields:org-database-usda:fdc_id sources_fields:org-database-usda:modified_date sources_fields:org-database-usda:publication_date data_sources categories_list
5 3662994002063 fr NaN NaN NaN NaN NaN Avocat hass NaN NaN ... 0.0 NaN NaN NaN NaN NaN NaN NaN App - smoothie-openfoodfacts, Apps, App - Yuka [en:plant-based-foods-and-beverages, en:plant-...
6 8437013031011 fr NaN NaN NaN NaN NaN Avocat NaN NaN ... 0.0 NaN NaN NaN NaN NaN NaN NaN App - Yuka, Apps [en:plant-based-foods-and-beverages, en:plant-...
14 4016249238155 de Saatenliebe Mohn Avocado NaN NaN NaN NaN NaN NaN NaN ... 15.0 NaN NaN NaN NaN NaN NaN NaN App - yuka, Apps [en:plant-based-foods-and-beverages, en:plant-...
17 8718963381532 de EAT ME AVOCADO NaN NaN NaN NaN NaN NaN NaN ... 0.0 NaN NaN NaN NaN NaN NaN NaN App - Yuka, Apps, App - smoothie-openfoodfacts [en:plant-based-foods-and-beverages, en:plant-...
23 8436002746707 es NaN NaN NaN AGUACATES CASADO PREMIUM NaN NaN NaN NaN ... 0.0 NaN NaN NaN NaN NaN NaN NaN App - elcoco, Apps, App - yuka, App - El CoCo [en:plant-based-foods-and-beverages, en:plant-...

5 rows × 185 columns

Where do most avocados come from?¶

In [18]:
# Filter DataFrame for UK data
avocados_uk = avocado[(avocado['countries']=='United Kingdom')]


# Find most common country for avocado origin
avocado_origin = (avocados_uk['origins_tags'].value_counts().index[0])
avocado_origin = avocado_origin.lstrip("en:")

avocado_origin
Out[18]:
'peru'

Lets create a general function to read and filter data for a particular ingredient, and return the top origin country for that food item¶

In [3]:
def read_and_filter_data(filename, relevant_categories):
    df = pd.read_csv(filename, sep='\t', low_memory=False)
  
  # Subset large DataFrame to include only relevant columns
    subset_columns = [ 'code', 'lc', 'product_name_en', 'quantity', 'serving_size', 'packaging_tags', 'brands', 'brands_tags', 'categories_tags', 'labels_tags', 'countries', 'countries_tags', 'origins','origins_tags']
    df = df[subset_columns]

    # Split tags into lists
    df['categories_list'] = df['categories_tags'].str.split(',')

    # Drop rows with null categories data
    df = df.dropna(subset = 'categories_list')

    # Filter data for relevant categories
    df = df[df['categories_list'].apply(lambda x: any([i for i in x if i in relevant_categories]))]
    
    # Filter data for the UK
    df_uk = df[(df['countries']=='United Kingdom')]

    # Find top origin country string with the highest count
    top_origin_string = (df_uk['origins_tags'].value_counts().index[0])

    # Clean up top origin country string
    top_origin_country = top_origin_string.lstrip("en:")
    top_origin_country = top_origin_country.replace('-', ' ')

    print(f'**{filename[:-4]} origins**','\n', top_origin_country, '\n')

    print ("Top origin country: ", top_origin_country)
    print ("\n")

    # End of function - return top origin country for this ingredient
    return top_origin_country
In [22]:
# Analyze avocado origins again, this time by calling function
top_avocado_origin = read_and_filter_data('avocado.csv',relevant_avocado_categories)
**avocado origins** 
 peru 

Top origin country:  peru


Repeat process above using function for the other 2 ingredients¶

In [5]:
# Read the file with UTF-8 encoding
with open("relevant_olive_oil_categories.txt", "r", encoding="utf-8") as file:
    relevant_olive_oil_categories = file.read().splitlines()
    file.close()
In [8]:
# Calling our function on olive_oil.csv
top_olive_oil_origin = read_and_filter_data('olive_oil.csv',relevant_olive_oil_categories)
**olive_oil origins** 
 greece 

Top origin country:  greece


In [32]:
# Gather relevant categories data for sourdough
with open("relevant_sourdough_categories.txt", "r") as file:
    relevant_sourdough_categories = file.read().splitlines()
    file.close()
In [34]:
# Call our function on sourdough.csv
top_sourdough_origin = read_and_filter_data('sourdough.csv',relevant_sourdough_categories)
**sourdough origins** 
 united kingdom 

Top origin country:  united kingdom


We can explore other questions also using the available food data!¶