Project Description¶
When working with very large datasets to create models that generate business value, one common challenge is that the models can take a very long time to produce predictions—sometimes days. To make these models run faster without reducing the dataset size, it is important to store the data as efficiently as possible.
In this project, we are working for a major online data science training company called Training Data Ltd. Our goal is to clean and optimize one of their largest customer datasets. This dataset will be used to build a model that predicts whether their students are actively looking for a new job. The company plans to use these predictions to connect students with potential recruiters.
We have access to a subset of the full customer data named customer_train.csv. This smaller dataset contains anonymized information about students and whether they were looking for a new job during their training period.
Here is an overview of the dataset columns:
| Column | Description |
|---|---|
student_id |
A unique identifier for each student. |
city |
A code representing the city where the student lives. |
city_development_index |
A scaled index that measures the development level of the city. |
gender |
The gender of the student. |
relevant_experience |
Indicates if the student has work experience relevant to their field. |
enrolled_university |
The type of university course the student is enrolled in, if any. |
education_level |
The highest education level attained by the student. |
major_discipline |
The main educational discipline or field of study of the student. |
experience |
The total number of years of work experience the student has. |
company_size |
The size of the company where the student is currently employed (number of employees). |
company_type |
The type of company the student works for. |
last_new_job |
The number of years between the student's current job and their previous job. |
training_hours |
The total number of training hours the student has completed. |
job_change |
A binary indicator where 1 means the student is looking for a new job, and 0 means they are not. |
Our task is to improve the dataset’s storage efficiency to help build a faster and more effective predictive model. We will do this by converting data types to more appropriate formats, creating ordered categorical variables, and filtering or optimizing these categories. This preparation will make the data ready for modeling and allow faster processing without loing important information.
Store the columns containing categories with only two factors as Booleans (bool)¶
Store the columns containing integers only as 32-bit integers (int32)¶
Store the columns containing floats as 16-bit floats (float16)¶
Store the columns containing nominal categorical data as the category data type¶
Store the columns containing ordinal categorical data ordered categories¶
Filter the final result to only contain students with 10 or more years of experience at companies with at least 1000 employees, as their recruiter base is suited to more experienced professionals at enterprise companies¶
# Import necessary libraries
import pandas as pd
# Load the dataset and create a copy
ds_jobs = pd.read_csv("customer_train.csv")
ds_jobs_transformed = ds_jobs.copy()
ds_jobs.head()
| student_id | city | city_development_index | gender | relevant_experience | enrolled_university | education_level | major_discipline | experience | company_size | company_type | last_new_job | training_hours | job_change | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 8949 | city_103 | 0.920 | Male | Has relevant experience | no_enrollment | Graduate | STEM | >20 | NaN | NaN | 1 | 36 | 1.0 |
| 1 | 29725 | city_40 | 0.776 | Male | No relevant experience | no_enrollment | Graduate | STEM | 15 | 50-99 | Pvt Ltd | >4 | 47 | 0.0 |
| 2 | 11561 | city_21 | 0.624 | NaN | No relevant experience | Full time course | Graduate | STEM | 5 | NaN | NaN | never | 83 | 0.0 |
| 3 | 33241 | city_115 | 0.789 | NaN | No relevant experience | NaN | Graduate | Business Degree | <1 | NaN | Pvt Ltd | never | 52 | 1.0 |
| 4 | 666 | city_162 | 0.767 | Male | Has relevant experience | no_enrollment | Masters | STEM | >20 | 50-99 | Funded Startup | 4 | 8 | 0.0 |
# EDA to help identify ordinal, nominal, and two-factor categories
for col in ds_jobs.select_dtypes("object").columns:
print(ds_jobs_transformed[col].value_counts(), '\n')
city
city_103 4355
city_21 2702
city_16 1533
city_114 1336
city_160 845
...
city_129 3
city_111 3
city_121 3
city_140 1
city_171 1
Name: count, Length: 123, dtype: int64
gender
Male 13221
Female 1238
Other 191
Name: count, dtype: int64
relevant_experience
Has relevant experience 13792
No relevant experience 5366
Name: count, dtype: int64
enrolled_university
no_enrollment 13817
Full time course 3757
Part time course 1198
Name: count, dtype: int64
education_level
Graduate 11598
Masters 4361
High School 2017
Phd 414
Primary School 308
Name: count, dtype: int64
major_discipline
STEM 14492
Humanities 669
Other 381
Business Degree 327
Arts 253
No Major 223
Name: count, dtype: int64
experience
>20 3286
5 1430
4 1403
3 1354
6 1216
2 1127
7 1028
10 985
9 980
8 802
15 686
11 664
14 586
1 549
<1 522
16 508
12 494
13 399
17 342
19 304
18 280
20 148
Name: count, dtype: int64
company_size
50-99 3083
100-499 2571
10000+ 2019
10-49 1471
1000-4999 1328
<10 1308
500-999 877
5000-9999 563
Name: count, dtype: int64
company_type
Pvt Ltd 9817
Funded Startup 1001
Public Sector 955
Early Stage Startup 603
NGO 521
Other 121
Name: count, dtype: int64
last_new_job
1 8040
>4 3290
2 2900
never 2452
4 1029
3 1024
Name: count, dtype: int64
# Create a dictionary of columns containing ordered categorical data
ordered_cats = {
'enrolled_university': ['no_enrollment', 'Part time course', 'Full time course'],
'education_level': ['Primary School', 'High School', 'Graduate', 'Masters', 'Phd'],
'experience': ['<1'] + list(map(str, range(1, 21))) + ['>20'],
'company_size': ['<10', '10-49', '50-99', '100-499', '500-999', '1000-4999', '5000-9999', '10000+'],
'last_new_job': ['never', '1', '2', '3', '4', '>4']
}
# Create a mapping dictionary of columns containing two-factor categories to convert to Booleans
two_factor_cats = {
'relevant_experience': {'No relevant experience': False, 'Has relevant experience': True},
'job_change': {0.0: False, 1.0: True}
}
# Loop through DataFrame columns to efficiently change data types
for col in ds_jobs_transformed:
# Convert two-factor categories to bool
if col in ['relevant_experience', 'job_change']:
ds_jobs_transformed[col] = ds_jobs_transformed[col].map(two_factor_cats[col])
# Convert integer columns to int32
elif col in ['student_id', 'training_hours']:
ds_jobs_transformed[col] = ds_jobs_transformed[col].astype('int32')
# Convert float columns to float16
elif col == 'city_development_index':
ds_jobs_transformed[col] = ds_jobs_transformed[col].astype('float16')
# Convert columns containing ordered categorical data to ordered categories using dict
elif col in ordered_cats.keys():
category = pd.CategoricalDtype(ordered_cats[col], ordered=True)
ds_jobs_transformed[col] = ds_jobs_transformed[col].astype(category)
# Convert remaining columns to standard categories
else:
ds_jobs_transformed[col] = ds_jobs_transformed[col].astype('category')
# Filter students with 10 or more years experience at companies with at least 1000 employees
ds_jobs_transformed = ds_jobs_transformed[(ds_jobs_transformed['experience'] >= '10') & (ds_jobs_transformed['company_size'] >= '1000-4999')]
ds_jobs_transformed.head()
| student_id | city | city_development_index | gender | relevant_experience | enrolled_university | education_level | major_discipline | experience | company_size | company_type | last_new_job | training_hours | job_change | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9 | 699 | city_103 | 0.919922 | NaN | True | no_enrollment | Graduate | STEM | 17 | 10000+ | Pvt Ltd | >4 | 123 | False |
| 12 | 25619 | city_61 | 0.913086 | Male | True | no_enrollment | Graduate | STEM | >20 | 1000-4999 | Pvt Ltd | 3 | 23 | False |
| 31 | 22293 | city_103 | 0.919922 | Male | True | Part time course | Graduate | STEM | 19 | 5000-9999 | Pvt Ltd | >4 | 141 | False |
| 34 | 26494 | city_16 | 0.910156 | Male | True | no_enrollment | Graduate | Business Degree | 12 | 5000-9999 | Pvt Ltd | 3 | 145 | False |
| 40 | 2547 | city_114 | 0.925781 | Female | True | Full time course | Masters | STEM | 16 | 1000-4999 | Public Sector | 2 | 14 | False |