Project Description¶

In this project, we help farmers choose the best crop to grow in their fields using machine learning. However, due to budget limits, farmers told us they can only test one of the four important soil measures:

  • Nitrogen (N) content in the soil
  • Phosphorous (P) content in the soil
  • Potassium (K) content in the soil
  • pH level of the soil

This is a classic feature selection problem. Our goal is to find out which one of these four features is the most helpful in predicting the right crop.


Background & Details¶

Farmers want to grow crops that will give them the highest yield, but testing all soil properties can be expensive. That’s why they need to choose just one soil test that gives them the most useful information.

Each crop grows best in a specific soil condition. For example, some crops need more nitrogen, while others grow better in slightly acidic soil. So choosing the right test helps farmers pick the best crop for their land.

We worked with a dataset called soil_measures.csv, which includes:

  • "N" – Nitrogen level in the soil
  • "P" – Phosphorous level in the soil
  • "K" – Potassium level in the soil
  • "pH" – Acidity/alkalinity level of the soil
  • "crop" – The crop that grows best in those conditions (our target)

Each row in the dataset represents soil test results from a different field.


Task¶

Our task is to:

  • Build a multi-class classification model that predicts the type of crop based on just one feature.
  • Try each feature separately (N, P, K, pH) and see which one gives the best prediction score.
  • Create a dictionary where:
    • The key is the name of the best feature
    • The value is the evaluation score (like accuracy or F1-score)

This will help farmers decide which single soil test is most valuable for predicting the right crop.

In [13]:
# Import required libraries!
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
In [2]:
# load the dataset
crops = pd.read_csv("soil_measures.csv")
crops.head()
Out[2]:
N P K ph crop
0 90 42 43 6.502985 rice
1 85 58 41 7.038096 rice
2 60 55 44 7.840207 rice
3 74 35 40 6.980401 rice
4 78 42 42 7.628473 rice
In [4]:
# check for missing values
crops.isna().sum()
Out[4]:
N       0
P       0
K       0
ph      0
crop    0
dtype: int64
In [5]:
# Check how many crops we have, i.e., multi-class target
crops.crop.unique()
Out[5]:
array(['rice', 'maize', 'chickpea', 'kidneybeans', 'pigeonpeas',
       'mothbeans', 'mungbean', 'blackgram', 'lentil', 'pomegranate',
       'banana', 'mango', 'grapes', 'watermelon', 'muskmelon', 'apple',
       'orange', 'papaya', 'coconut', 'cotton', 'jute', 'coffee'],
      dtype=object)
In [6]:
# Split into feature and target sets
X = crops.drop(columns="crop")
y = crops["crop"]
In [9]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
In [21]:
# Create a dictionary to store the model performance for each feature
feature_performance = {}


# for scaling the data
scaler = StandardScaler()
In [24]:
for feature in ["N", "P", "K", "ph"]:
    X_train_scaled = scaler.fit_transform(X_train[[feature]])
    X_test_scaled = scaler.transform(X_test[[feature]])
    
    log_reg = LogisticRegression(max_iter=500, solver="saga")  
    log_reg.fit(X_train_scaled, y_train)
    y_pred = log_reg.predict(X_test_scaled)

    f1 = metrics.f1_score(y_test, y_pred, average="weighted")
    feature_performance[feature] = f1

    print(f"F1-score for {feature}: {f1}")
F1-score for N: 0.10689900116508289
F1-score for P: 0.08426955444720075
F1-score for K: 0.13831456375684123
F1-score for ph: 0.045464856528065166
In [25]:
# K produced the best F1 score
best_feature = max(feature_performance, key=feature_performance.get)
print(f"\nBest feature: {best_feature} with F1-score: {feature_performance[best_feature]}")
Best feature: K with F1-score: 0.13831456375684123

It looks like "K" (Potassium) is the best single feature for predicting the "crop" variable, but the F1-score (0.138) is still quite low. This suggests that using only one feature isn't very effective for this classification problem.¶

Try Other Models¶

Decision Tree Classifier¶

In [26]:
from sklearn.tree import DecisionTreeClassifier

tree_clf = DecisionTreeClassifier()
tree_clf.fit(X_train[["K"]], y_train)
y_pred_tree = tree_clf.predict(X_test[["K"]])

f1_tree = metrics.f1_score(y_test, y_pred_tree, average="weighted")
print(f"Decision Tree F1-score for K: {f1_tree}")
Decision Tree F1-score for K: 0.32406128337824247

Random Forest Classifier (handles non-linearity better)¶

In [27]:
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train[["K"]], y_train)
y_pred_rf = rf_clf.predict(X_test[["K"]])

f1_rf = metrics.f1_score(y_test, y_pred_rf, average="weighted")
print(f"Random Forest F1-score for K: {f1_rf}")
Random Forest F1-score for K: 0.30604986326799966

Check Feature Importance¶

In [28]:
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

importances = rf_model.feature_importances_
feature_names = ["N", "P", "K", "ph"]

for name, importance in zip(feature_names, importances):
    print(f"Feature: {name}, Importance: {importance:.4f}")
Feature: N, Importance: 0.2143
Feature: P, Importance: 0.2530
Feature: K, Importance: 0.3241
Feature: ph, Importance: 0.2086

"K" (Potassium) is the best single feature for predicting the "crop" variable, but the Decision Tree F1-score for K: 0.3240 is still quite low. Therefore single feature is not a good idea for this problem.