Project Description¶

Data comes in many different shapes, sizes, and formats. One important skill for any data professional is to combine and clean data from multiple sources so it can be analyzed effectively.

In this project, we will work with human resources data that is stored in different file formats: CSV, Excel, and JSON. Our main goal is to organize and structure this data because right now it is scattered across different teams and files, making it hard to use.

We will work with the following datasets:

  • Office addresses
    This data is saved in the file office_addresses.csv.
    If the office value is NaN, it means the employee works remotely.

  • Employee addresses
    This information is found on the first sheet of the Excel file employee_information.xlsx.

  • Employee emergency contacts
    This data is on the second sheet of the same Excel file, in a tab named emergency_contacts.
    However, the headers were removed by mistake at some point. The HR manager told us the correct headers should be:
    employee_id, last_name, first_name, emergency_contact, emergency_contact_number, and relationship.

  • Employee roles, teams, and salaries
    This data was exported from the company’s human resources management system and is stored in the JSON file employee_roles.json.
    The structure of this file looks like this:

    {
      "A2R5H9": {
        "title": "CEO",
        "monthly_salary": "$4500",
        "team": "Leadership"
      },
      ...
    }
    

Create a single pandas DataFrame called employees_final containing:¶

Index: employee_id.

Columns: Ensure the DataFrame contains only the following columns, in the exact order listed: employee_first_name, employee_last_name, employee_country, employee_city, employee_street, employee_street_number, emergency_contact, emergency_contact_number, relationship, monthly_salary, team, title, office, office_country, office_city, office_street, office_street_number.

Assign employees to offices based on their country. For any columns that begin with office, replace missing data with "Remote".

In [1]:
# Import pandas
import pandas as pd

# Read office_addresses.csv
offices = pd.read_csv("office_addresses.csv")

# Read employee_information.xlsx
addresses = pd.read_excel("employee_information.xlsx")
In [2]:
# Declare a list of new column names
emergency_contacts_header = ["employee_id", "last_name", "first_name",
                             "emergency_contact", "emergency_contact_number", "relationship"]
In [4]:
# Read employee_information.xlsx
emergency_contacts = pd.read_excel("employee_information.xlsx", 
                                   sheet_name="emergency_contacts", 
                                   header=None,
                                   names=emergency_contacts_header)

# Read in employee_roles.json
roles = pd.read_json("employee_roles.json", orient="index")

# Merge addresses with offices
employees = addresses.merge(offices, left_on="employee_country", right_on="office_country", how="left")

# Merge employees with roles
employees = employees.merge(roles, left_on="employee_id", right_on=roles.index)

# Merge employees with emergency_contacts
employees = employees.merge(emergency_contacts, on="employee_id")
In [ ]:
# Fill null values in office columns
for col in ["office", "office_country", "office_city", "office_street", "office_street_number"]:
    employees[col].fillna("Remote", inplace=True)
    
# Create final columns
final_columns = ["employee_id", "employee_first_name", "employee_last_name", "employee_country", 
                 "employee_city", "employee_street", "employee_street_number", 
                 "emergency_contact", "emergency_contact_number", "relationship", 
                 "monthly_salary", "team", "title", "office", "office_country", 
                 "office_city", "office_street", "office_street_number"]

# Subset for the required columns
employees_final = employees[final_columns]

# Set employee_id as the index
employees_final.set_index("employee_id", inplace=True)
In [5]:
# Fill null values in office columns
for col in ["office", "office_country", "office_city", "office_street", "office_street_number"]:
    employees[col].fillna("Remote", inplace=True)
C:\Users\newbe\AppData\Local\Temp\ipykernel_10800\4009407767.py:3: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value 'Remote' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  employees[col].fillna("Remote", inplace=True)
In [6]:
# Create final columns
final_columns = ["employee_id", "employee_first_name", "employee_last_name", "employee_country", 
                 "employee_city", "employee_street", "employee_street_number", 
                 "emergency_contact", "emergency_contact_number", "relationship", 
                 "monthly_salary", "team", "title", "office", "office_country", 
                 "office_city", "office_street", "office_street_number"]

# Subset for the required columns
employees_final = employees[final_columns]

# Set employee_id as the index
employees_final.set_index("employee_id", inplace=True)

employees_final.head()
Out[6]:
employee_first_name employee_last_name employee_country employee_city employee_street employee_street_number emergency_contact emergency_contact_number relationship monthly_salary team title office office_country office_city office_street office_street_number
employee_id
A2R5H9 Jax Hunman BE Leuven Grote Markt 9 Opie Hurst +32-456-5556-84 Brother $4500 Leadership CEO Leuven Office BE Leuven Martelarenlaan 38.0
H8K0L6 Tara Siff GB London Baker Street 221 Wendy de Matteo +44-020-5554-333 Sister $4500 Leadership CFO WeWork Office GB London Old Street 207.0
G4R7V0 Gemma Sagal US New-York Perry Street 66 John Newmark +1-202-555-194 Husband $3000 Sales Business Developer ESB Office US New York City Fifth Avenue 350.0
M1Z7U9 Tig Coates FR Paris Rue de l'Université 7 Venus Noone +1-202-555-0130 Wife $2000 People Operations Office Manager Remote Remote Remote Remote Remote