Project Description¶
Data comes in many different shapes, sizes, and formats. One important skill for any data professional is to combine and clean data from multiple sources so it can be analyzed effectively.
In this project, we will work with human resources data that is stored in different file formats: CSV, Excel, and JSON. Our main goal is to organize and structure this data because right now it is scattered across different teams and files, making it hard to use.
We will work with the following datasets:
Office addresses
This data is saved in the fileoffice_addresses.csv
.
If the office value isNaN
, it means the employee works remotely.Employee addresses
This information is found on the first sheet of the Excel fileemployee_information.xlsx
.Employee emergency contacts
This data is on the second sheet of the same Excel file, in a tab namedemergency_contacts
.
However, the headers were removed by mistake at some point. The HR manager told us the correct headers should be:
employee_id
,last_name
,first_name
,emergency_contact
,emergency_contact_number
, andrelationship
.Employee roles, teams, and salaries
This data was exported from the company’s human resources management system and is stored in the JSON fileemployee_roles.json
.
The structure of this file looks like this:{ "A2R5H9": { "title": "CEO", "monthly_salary": "$4500", "team": "Leadership" }, ... }
Create a single pandas DataFrame called employees_final
containing:¶
Index: employee_id.
Columns: Ensure the DataFrame contains only the following columns, in the exact order listed: employee_first_name
, employee_last_name
, employee_country
, employee_city
, employee_street
, employee_street_number
, emergency_contact
, emergency_contact_number
, relationship
, monthly_salary
, team
, title
, office
, office_country
, office_city
, office_street
, office_street_number
.
Assign employees to offices based on their country. For any columns that begin with office
, replace missing data with "Remote"
.
# Import pandas
import pandas as pd
# Read office_addresses.csv
offices = pd.read_csv("office_addresses.csv")
# Read employee_information.xlsx
addresses = pd.read_excel("employee_information.xlsx")
# Declare a list of new column names
emergency_contacts_header = ["employee_id", "last_name", "first_name",
"emergency_contact", "emergency_contact_number", "relationship"]
# Read employee_information.xlsx
emergency_contacts = pd.read_excel("employee_information.xlsx",
sheet_name="emergency_contacts",
header=None,
names=emergency_contacts_header)
# Read in employee_roles.json
roles = pd.read_json("employee_roles.json", orient="index")
# Merge addresses with offices
employees = addresses.merge(offices, left_on="employee_country", right_on="office_country", how="left")
# Merge employees with roles
employees = employees.merge(roles, left_on="employee_id", right_on=roles.index)
# Merge employees with emergency_contacts
employees = employees.merge(emergency_contacts, on="employee_id")
# Fill null values in office columns
for col in ["office", "office_country", "office_city", "office_street", "office_street_number"]:
employees[col].fillna("Remote", inplace=True)
# Create final columns
final_columns = ["employee_id", "employee_first_name", "employee_last_name", "employee_country",
"employee_city", "employee_street", "employee_street_number",
"emergency_contact", "emergency_contact_number", "relationship",
"monthly_salary", "team", "title", "office", "office_country",
"office_city", "office_street", "office_street_number"]
# Subset for the required columns
employees_final = employees[final_columns]
# Set employee_id as the index
employees_final.set_index("employee_id", inplace=True)
# Fill null values in office columns
for col in ["office", "office_country", "office_city", "office_street", "office_street_number"]:
employees[col].fillna("Remote", inplace=True)
C:\Users\newbe\AppData\Local\Temp\ipykernel_10800\4009407767.py:3: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value 'Remote' has dtype incompatible with float64, please explicitly cast to a compatible dtype first. employees[col].fillna("Remote", inplace=True)
# Create final columns
final_columns = ["employee_id", "employee_first_name", "employee_last_name", "employee_country",
"employee_city", "employee_street", "employee_street_number",
"emergency_contact", "emergency_contact_number", "relationship",
"monthly_salary", "team", "title", "office", "office_country",
"office_city", "office_street", "office_street_number"]
# Subset for the required columns
employees_final = employees[final_columns]
# Set employee_id as the index
employees_final.set_index("employee_id", inplace=True)
employees_final.head()
employee_first_name | employee_last_name | employee_country | employee_city | employee_street | employee_street_number | emergency_contact | emergency_contact_number | relationship | monthly_salary | team | title | office | office_country | office_city | office_street | office_street_number | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
employee_id | |||||||||||||||||
A2R5H9 | Jax | Hunman | BE | Leuven | Grote Markt | 9 | Opie Hurst | +32-456-5556-84 | Brother | $4500 | Leadership | CEO | Leuven Office | BE | Leuven | Martelarenlaan | 38.0 |
H8K0L6 | Tara | Siff | GB | London | Baker Street | 221 | Wendy de Matteo | +44-020-5554-333 | Sister | $4500 | Leadership | CFO | WeWork Office | GB | London | Old Street | 207.0 |
G4R7V0 | Gemma | Sagal | US | New-York | Perry Street | 66 | John Newmark | +1-202-555-194 | Husband | $3000 | Sales | Business Developer | ESB Office | US | New York City | Fifth Avenue | 350.0 |
M1Z7U9 | Tig | Coates | FR | Paris | Rue de l'Université | 7 | Venus Noone | +1-202-555-0130 | Wife | $2000 | People Operations | Office Manager | Remote | Remote | Remote | Remote | Remote |