Project Description¶

We are pretending to be Data Scientists working with a car insurance company. They want our help to predict whether a customer will make a claim on their car insurance during the policy period.

Car insurance is a huge business, and companies spend a lot of time and money trying to guess who is more likely to file a claim. This helps them set fair prices and manage risk better.

In this project, the company wants to start simple. They don’t have advanced tools yet, so they asked us to:

  • Find one feature (column) from the data that gives the most accurate predictions.
  • Use that feature to build a logistic regression model (a type of machine learning model).
  • Measure performance using accuracy.

Goal¶

Our job is to:

  • Build several models, each using only one feature at a time.
  • Compare how well they predict whether someone will make a claim or not.
  • Tell the company which single feature works best.

The Dataset¶

The file is called car_insurance.csv and contains information about different customers. The last column, outcome, shows whether the customer made a claim (1) or not (0).

Dataset Columns¶

Column Description
id Unique ID for each customer.
age Customer age group:
0: 16–25, 1: 26–39, 2: 40–64, 3: 65+
gender 0: Female, 1: Male
driving_experience Years of driving:
0: 0–9, 1: 10–19, 2: 20–29, 3: 30+
education 0: No education, 1: High school, 2: University
income 0: Poverty, 1: Working class, 2: Middle class, 3: Upper class
credit_score Score between 0 and 1 (higher is better)
vehicle_ownership 0: Doesn’t own car, 1: Owns car
vehcile_year 0: Before 2015, 1: 2015 or later
married 0: Not married, 1: Married
children Number of children
postal_code Area code (not useful for prediction)
annual_mileage How many miles they drive per year
vehicle_type 0: Sedan, 1: Sports car
speeding_violations Total number of speeding tickets
duis Number of DUI offenses (drunk driving)
past_accidents Number of past car accidents
outcome Target column — 1: Made a claim, 0: Did not make a claim

Final Objective¶

Test every single feature in the dataset one at a time to find out:

  • Which feature gives the highest prediction accuracy for the outcome?
  • This will help the company start simple with their machine learning strategy.
In [2]:
# Import required modules
import pandas as pd
import numpy as np
from statsmodels.formula.api import logit
In [3]:
# Read in dataset
cars = pd.read_csv("car_insurance.csv")

# Check for missing values
cars.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   id                   10000 non-null  int64  
 1   age                  10000 non-null  int64  
 2   gender               10000 non-null  int64  
 3   driving_experience   10000 non-null  object 
 4   education            10000 non-null  object 
 5   income               10000 non-null  object 
 6   credit_score         9018 non-null   float64
 7   vehicle_ownership    10000 non-null  float64
 8   vehicle_year         10000 non-null  object 
 9   married              10000 non-null  float64
 10  children             10000 non-null  float64
 11  postal_code          10000 non-null  int64  
 12  annual_mileage       9043 non-null   float64
 13  vehicle_type         10000 non-null  object 
 14  speeding_violations  10000 non-null  int64  
 15  duis                 10000 non-null  int64  
 16  past_accidents       10000 non-null  int64  
 17  outcome              10000 non-null  float64
dtypes: float64(6), int64(7), object(5)
memory usage: 1.4+ MB
In [4]:
# Fill missing values with the mean
cars["credit_score"].fillna(cars["credit_score"].mean(), inplace=True)
cars["annual_mileage"].fillna(cars["annual_mileage"].mean(), inplace=True)
In [7]:
# Empty list to store model results
models = []

# Empty list to store accuracies
accuracies = []

# Feature columns
features = cars.drop(columns=["id", "outcome"]).columns
In [8]:
# Loop through features
for col in features:
    # Create a model
    model = logit(f"outcome ~ {col}", data=cars).fit()
    # Add each model to the models list
    models.append(model)
Optimization terminated successfully.
         Current function value: 0.511794
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.615951
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.467092
         Iterations 8
Optimization terminated successfully.
         Current function value: 0.603742
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.531499
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.572557
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.552412
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.572668
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.586659
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.595431
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.617345
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.605716
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.621700
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.558922
         Iterations 7
Optimization terminated successfully.
         Current function value: 0.598699
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.549220
         Iterations 7
In [9]:
# Loop through models
for feature in range(0, len(models)):
    # Compute the confusion matrix
    conf_matrix = models[feature].pred_table()
    # True negatives
    tn = conf_matrix[0,0]
    # True positives
    tp = conf_matrix[1,1]
    # False negatives
    fn = conf_matrix[1,0]
    # False positives
    fp = conf_matrix[0,1]
    # Compute accuracy
    acc = (tn + tp) / (tn + fn + fp + tp)
    accuracies.append(acc)
In [10]:
# Find the feature with the largest accuracy
best_feature = features[accuracies.index(max(accuracies))]

# Create best_feature_df
best_feature_df = pd.DataFrame({"best_feature": best_feature,
                                "best_accuracy": max(accuracies)},
                                index=[0])
best_feature_df
Out[10]:
best_feature best_accuracy
0 driving_experience 0.7771
In [ ]:
**driving_experience** is the best single feature for predicting the "crop" variable