Project Description¶

Imagine you're a Data Scientist working at a top Health Insurance company. Your job is to create a machine learning model that predicts how much a customer will be charged for their medical expenses.

But this project is not just about building a model — you’ll also test it on new customer data to see how well it works in the real world. This can help the company give better advice to customers and help them plan their healthcare spending more wisely.


Goal¶

Use machine learning to:

  • Understand which factors affect healthcare costs.
  • Predict medical charges for both existing and new customers.
  • Support better financial planning and service personalization.

Dataset Summary¶

We are using a file called insurance.csv, which contains information about people who have health insurance. This data will help us find patterns in medical charges.

insurance.csv¶

Column Description
age Age of the person (main customer).
sex Gender of the person (male or female).
bmi Body Mass Index — a number that shows if someone is underweight, normal, or overweight.
children How many dependents (kids) are covered by the insurance.
smoker Whether the person smokes (yes or no).
region The region where the person lives (Northeast, Northwest, Southeast, Southwest).
charges Medical costs billed by the insurance company (our target column).

Final Step¶

Once we build and train our model using the insurance.csv data, we will test it on a new dataset called validation_dataset.csv. This file is similar to the original one but does not include the charges column.

Our task is to predict the missing charges using the model. This helps us check how well the model works on real-world data.

In [1]:
# import the required libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline


# load the dataset
insurance = pd.read_csv('insurance.csv', )
insurance.head()
Out[1]:
age sex bmi children smoker region charges
0 19.0 female 27.900 0.0 yes southwest 16884.924
1 18.0 male 33.770 1.0 no Southeast 1725.5523
2 28.0 male 33.000 3.0 no southeast $4449.462
3 33.0 male 22.705 0.0 no northwest $21984.47061
4 32.0 male 28.880 0.0 no northwest $3866.8552
In [6]:
def clean_dataset(insurance):
    """
    Cleans the insurance dataset by performing several preprocessing tasks:
    - Corrects the 'sex' column values to a standard format ('male', 'female').
    - Removes the dollar sign from the 'charges' column and converts it to float.
    - Drops negative 'age' values.
    - Converts negative 'children' values to zero.
    - Converts 'region' values to lowercase.
    - Drops rows with any missing values.
    
    Parameters:
    - insurance: pandas DataFrame, the insurance dataset.
    
    Returns:
    - DataFrame after cleaning.
    """
    insurance['sex'] = insurance['sex'].replace({'M': 'male', 'man': 'male', 'F': 'female', 'woman': 'female'})
    insurance['charges'] = insurance['charges'].replace({'\$': ''}, regex=True).astype(float)
    insurance = insurance[insurance["age"] > 0]
    insurance.loc[insurance["children"] < 0, "children"] = 0
    insurance.loc[:, "region"] = insurance["region"].str.lower()

    return insurance.dropna()
In [7]:
def create_and_evaluate_regression_model(insurance):
    """
    Prepares the data, fits a linear regression model, and evaluates it using cross-validation.
    
    Parameters:
    - insurance: pandas DataFrame, the cleaned insurance dataset.
    
    Returns:
    - A tuple containing the fitted sklearn Pipeline object, mean MSE, and mean R2 scores.
    """
    # Preprocessing
    X = insurance.drop('charges', axis=1)
    y = insurance['charges']
    categorical_features = ['sex', 'smoker', 'region']
    numerical_features = ['age', 'bmi', 'children']
    
    # Convert categorical variables to dummy variables
    X_categorical = pd.get_dummies(X[categorical_features], drop_first=True)
    
    # Combine numerical features with dummy variables
    X_processed = pd.concat([X[numerical_features], X_categorical], axis=1)
    # Scaling numerical features
    scaler = StandardScaler()
    X_scaled = pd.DataFrame(scaler.fit_transform(X_processed), columns=X_processed.columns)
    # Linear regression model
    lin_reg = LinearRegression()
    
    # Pipeline
    steps = [("scaler", scaler), ("lin_reg", lin_reg)]
    insurance_model_pipeline = Pipeline(steps)
    
    # Fitting the model
    insurance_model_pipeline.fit(X_scaled, y)
    
    # Evaluating the model
    mse_scores = -cross_val_score(insurance_model_pipeline, X_scaled, y, cv=5, scoring='neg_mean_squared_error')
    r2_scores = cross_val_score(insurance_model_pipeline, X_scaled, y, cv=5, scoring='r2')
    mean_mse = np.mean(mse_scores)
    mean_r2 = np.mean(r2_scores)
    
    return insurance_model_pipeline, mean_mse, mean_r2
In [8]:
# Usage example
cleaned_insurance = clean_dataset(insurance)
insurance_model, mean_mse, r2_score = create_and_evaluate_regression_model(cleaned_insurance)
print("Mean MSE:", mean_mse)
print("Mean R2:", r2_score)
Mean MSE: 37431001.52191916
Mean R2: 0.7450511466263761
In [9]:
# Predict on validation data
validation_data = pd.read_csv('validation_dataset.csv')

# Ensure categorical variables are properly transformed
validation_data_processed = pd.get_dummies(validation_data, columns=['sex', 'smoker', 'region'], drop_first=True)

# Make predictions using the trained model
validation_predictions = insurance_model.predict(validation_data_processed)

# Add predicted charges to the validation data
validation_data['predicted_charges'] = validation_predictions

# Adjust predictions to ensure minimum charge is $1000
validation_data.loc[validation_data['predicted_charges'] < 1000, 'predicted_charges'] = 1000

# Display the updated dataframe
validation_data.head()
Out[9]:
age sex bmi children smoker region predicted_charges
0 18.0 female 24.090000 1.0 no southeast 128624.195643
1 39.0 male 26.410000 0.0 yes northeast 220740.537449
2 27.0 male 29.150000 0.0 yes southeast 181357.588606
3 71.0 male 65.502135 13.0 yes southeast 423490.687270
4 28.0 male 38.060000 0.0 no southeast 193247.431989
In [ ]: