Building a Diabetes Prediction Model: From Data to Insights
Building a Diabetes Prediction Model: From Data to Insights
Diabetes affects millions of people worldwide, and early detection can significantly improve health outcomes. In this post, I’ll walk you through my journey of building a machine learning model to predict diabetes risk using real health data from the CDC (Centers for Disease Control and Prevention).
Source: Unsplash
This project represents my latest exploration into healthcare machine learning, building on my previous work with image classification and other AI applications. The goal was to create a robust model that could help identify individuals at risk of diabetes based on various health indicators.
We, software engineers and data scientists, spend a good chunk of time building models that can make a real impact. And we should because predictive modeling in healthcare is an important activity that allows us to:
- Detect potential health risks before they become serious problems
- Check if our models are doing what they’re supposed to do
- Ensure our predictions are reliable and actionable
- Maintain consistent model performance and interpretability
The Dataset: CDC Diabetes Health Indicators
We’re working with the CDC Diabetes Health Indicators dataset, which contains comprehensive health information from over 240,000 individuals. This dataset includes various health metrics that could be predictive of diabetes risk.
The dataset schema will require the following key features:
- Demographic data: Age, sex, education level, income
- Health indicators: BMI, blood pressure, cholesterol levels
- Lifestyle factors: Physical activity, smoking status, alcohol consumption
- Medical history: Heart disease, stroke, kidney disease
- Target variable: Diabetes status (binary: diabetic/non-diabetic)
This can be a comprehensive dataset with health information that we expect to have mostly read operations for analysis and model training. Writes are expected when we need to update the dataset with new health records.
Setting Up Our Environment
First, let’s import the necessary libraries and set up our workspace. This can be a straightforward setup with Python, pandas, and scikit-learn with a Jupyter notebook environment and a data processing pipeline.
# Import necessary libraries
import numpy as np
import pandas as pd
# Import visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
# Import model libraries
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.pipeline import Pipeline
# Import sampler libraries for handling imbalanced data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as imbPipeline
# Set the decimal format for cleaner output
pd.options.display.float_format = "{:.2f}".format
Data Exploration and Preprocessing
Loading and Initial Data Exploration
First, let’s load our dataset and get a sense of what we’re working with:
# Load the dataset
df = pd.read_csv('diabetes_health_indicators.csv')
# Basic information about our dataset
print(f"Dataset shape: {df.shape}")
print(f"Number of features: {df.shape[1]}")
print(f"Number of samples: {df.shape[0]}")
# Check the target variable distribution
diabetes_counts = df['Diabetes_binary'].value_counts()
print("\nDiabetes Distribution:")
print(diabetes_counts)
print(f"\nPercentage of diabetic individuals: {(diabetes_counts[1]/len(df)*100):.2f}%")
Understanding the Data Distribution
One of the first challenges we encountered was class imbalance. The dataset showed a significant imbalance between diabetic and non-diabetic individuals:
- Non-diabetic individuals: ~85% of the dataset
- Diabetic individuals: ~15% of the dataset
This imbalance is typical in healthcare datasets but poses challenges for model training. We don’t want to be that data scientist who keeps reminding everyone about class imbalance issues. Let’s visualize this:
# Visualize the class distribution
plt.figure(figsize=(12, 4))
# Bar chart
plt.subplot(1, 3, 1)
sns.countplot(x='Diabetes_binary', data=df, palette=['skyblue', 'lightcoral'])
plt.title('Diabetes Distribution (Bar Chart)')
plt.xlabel('Diabetes Status (0=Non-Diabetic, 1=Diabetic)')
plt.ylabel('Count')
# Pie chart
plt.subplot(1, 3, 2)
diabetes_counts = df['Diabetes_binary'].value_counts()
plt.pie(diabetes_counts, labels=['Non-Diabetic (0)', 'Diabetic (1)'],
autopct='%1.1f%%', startangle=90, colors=['skyblue', 'lightcoral'])
plt.title('Diabetes Distribution (Pie Chart)')
# Table
plt.subplot(1, 3, 3)
table_data = [[label, count] for label, count in diabetes_counts.items()]
table = plt.table(cellText=table_data, colLabels=['Diabetes Status', 'Count'], loc='center')
table.auto_set_font_size(False)
table.set_fontsize(12)
plt.axis('off')
plt.title('Diabetes Distribution (Table)')
plt.tight_layout()
plt.show()
Data Preprocessing and Feature Engineering
Before training our models, we need to prepare the data properly. This can be a RESTful or a Python-based service with pandas and scikit-learn with a data preprocessing pipeline and a feature engineering system.
# Convert target variable to numeric
df['Diabetes_binary'] = df['Diabetes_binary'].map({'Non-Diabetic': 0, 'Diabetic': 1})
# Create age groups for better feature representation
def create_age_groups(age):
if age < 35:
return 'Young'
elif age < 65:
return 'Middle-aged'
else:
return 'Senior'
df['Age_Group'] = df['Age'].apply(create_age_groups)
# Create BMI categories
def categorize_bmi(bmi):
if bmi < 18.5:
return 'Underweight'
elif bmi < 25:
return 'Normal'
elif bmi < 30:
return 'Overweight'
else:
return 'Obese'
df['BMI_Category'] = df['BMI'].apply(categorize_bmi)
# Create overall health score (composite of multiple health indicators)
health_indicators = ['GenHlth', 'MentHlth', 'PhysHlth']
df['OverallHlthScore'] = df[health_indicators].mean(axis=1)
# Check for missing values
print("Missing values in each column:")
print(df.isnull().sum())
# Handle missing values
df = df.fillna(df.median()) # For numeric columns
df = df.fillna(df.mode().iloc[0]) # For categorical columns
Feature Selection and Correlation Analysis
Let’s analyze which features are most correlated with diabetes:
# Calculate correlation with target variable
corr = df.corr()
target_corr = corr['Diabetes_binary'].drop('Diabetes_binary')
# Plot correlation with diabetes
plt.figure(figsize=(10, 6))
target_corr.sort_values(ascending=True).plot(kind='barh')
plt.title('Correlation with Diabetes')
plt.xlabel('Correlation Coefficient')
plt.tight_layout()
plt.show()
# Show top 10 most correlated features
print("Top 10 features most correlated with diabetes:")
print(target_corr.abs().sort_values(ascending=False).head(10))
Model Development Strategy
Now what? There are different solutions that we can go about, each with its own pros and cons. These design solutions can be combined together to achieve optimal results. Let’s prepare our data for modeling and implement our machine learning pipeline:
Data Preparation for Modeling
# Separate features and target
features = df.drop(['Diabetes_binary'], axis=1)
target = df['Diabetes_binary']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
features, target, test_size=0.2, random_state=42, stratify=target
)
print(f"Training set size: {X_train.shape[0]}")
print(f"Testing set size: {X_test.shape[0]}")
print(f"Training set diabetes rate: {y_train.mean():.3f}")
print(f"Testing set diabetes rate: {y_test.mean():.3f}")
# Define preprocessing pipeline
numeric_features = ['Age', 'BMI', 'GenHlth', 'MentHlth', 'PhysHlth', 'OverallHlthScore']
categorical_features = ['Sex', 'Education', 'Income', 'Age_Group', 'BMI_Category']
# Create preprocessing transformers
numeric_transformer = Pipeline(steps=[
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(drop='first', sparse=False))
])
# Combine transformers
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
1. Random Forest Classifier Implementation
Our baseline model achieved solid performance with balanced class weights. This can be a straightforward approach if your service relies on ensemble methods for classification.
# Random Forest with balanced class weights
rf_pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(
n_estimators=100,
max_depth=10,
min_samples_split=5,
min_samples_leaf=2,
class_weight='balanced',
random_state=42
))
])
# Train the model
print("Training Random Forest model...")
rf_pipeline.fit(X_train, y_train)
# Make predictions
y_pred_rf = rf_pipeline.predict(X_test)
y_pred_proba_rf = rf_pipeline.predict_proba(X_test)[:, 1]
# Evaluate the model
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score
print("Random Forest Results:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_rf):.4f}")
print(f"ROC AUC: {roc_auc_score(y_test, y_pred_proba_rf):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_rf))
# Plot confusion matrix
plt.figure(figsize=(8, 6))
cm = confusion_matrix(y_test, y_pred_rf)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=['Non-Diabetic', 'Diabetic'],
yticklabels=['Non-Diabetic', 'Diabetic'])
plt.title('Random Forest Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()
2. Handling Class Imbalance with SMOTE
To address the class imbalance, we implemented SMOTE (Synthetic Minority Over-sampling Technique). This is a separate system that is worthy of its own post, but let’s treat this as a black box that serves as an abstraction between our imbalanced data and balanced training sets.
# SMOTE for handling class imbalance
over = SMOTE(sampling_strategy=0.7, random_state=42)
under = RandomUnderSampler(sampling_strategy=0.8, random_state=42)
# Create sampling pipeline
sampling_pipeline = imbPipeline([
('over', over),
('under', under)
])
# Apply sampling to training data
X_train_resampled, y_train_resampled = sampling_pipeline.fit_resample(X_train, y_train)
print("Original training set distribution:")
print(pd.Series(y_train).value_counts())
print("\nResampled training set distribution:")
print(pd.Series(y_train_resampled).value_counts())
# Train Random Forest on resampled data
rf_resampled = Pipeline([
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(
n_estimators=100,
max_depth=10,
class_weight='balanced',
random_state=42
))
])
rf_resampled.fit(X_train_resampled, y_train_resampled)
y_pred_resampled = rf_resampled.predict(X_test)
y_pred_proba_resampled = rf_resampled.predict_proba(X_test)[:, 1]
print("\nRandom Forest with SMOTE Results:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_resampled):.4f}")
print(f"ROC AUC: {roc_auc_score(y_test, y_pred_proba_resampled):.4f}")
3. Multiple Algorithm Comparison
We tested several algorithms to find the best performer. Here’s how we implemented each. These are some strategies to consider when choosing your machine learning algorithms:
# Import additional algorithms
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
import xgboost as xgb
# Define our models
models = {
'Random Forest': RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42),
'XGBoost': xgb.XGBClassifier(n_estimators=100, random_state=42),
'SVM': SVC(probability=True, random_state=42),
'KNN': KNeighborsClassifier(n_neighbors=5),
'AdaBoost': AdaBoostClassifier(n_estimators=100, random_state=42),
'Logistic Regression': LogisticRegression(random_state=42)
}
# Train and evaluate all models
results = {}
for name, model in models.items():
print(f"\nTraining {name}...")
# Create pipeline
pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', model)
])
# Train model
pipeline.fit(X_train_resampled, y_train_resampled)
# Make predictions
y_pred = pipeline.predict(X_test)
y_pred_proba = pipeline.predict_proba(X_test)[:, 1]
# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)
results[name] = {
'accuracy': accuracy,
'roc_auc': roc_auc,
'predictions': y_pred,
'probabilities': y_pred_proba
}
print(f"{name} - Accuracy: {accuracy:.4f}, ROC AUC: {roc_auc:.4f}")
# Display results comparison
results_df = pd.DataFrame({
'Model': list(results.keys()),
'Accuracy': [results[model]['accuracy'] for model in results.keys()],
'ROC AUC': [results[model]['roc_auc'] for model in results.keys()]
})
print("\nModel Performance Comparison:")
print(results_df.sort_values('Accuracy', ascending=False))
Results Summary:
- Random Forest: 85.8% accuracy
- XGBoost: 85.8% accuracy
- SVM: 83.7% accuracy
- KNN: 78.7% accuracy
- AdaBoost: 73.8% accuracy
- Logistic Regression: 77.4% accuracy
Implementation tip: Use scikit-learn for traditional ML algorithms, XGBoost for gradient boosting, and consider the trade-offs when choosing between different algorithms.
Key Findings and Insights
Feature Importance Analysis
Let’s extract and visualize the feature importance from our Random Forest model. This component reads from the model’s feature importance attributes and provides insights into what drives our predictions.
# Get feature names after preprocessing
feature_names = (numeric_features +
[f"{col}_{val}" for col, vals in
preprocessor.named_transformers_['cat'].named_steps['onehot'].categories_
for val in vals[1:]])
# Extract feature importance from the best model
best_rf_model = rf_resampled.named_steps['classifier']
feature_importance = best_rf_model.feature_importances_
# Create feature importance dataframe
importance_df = pd.DataFrame({
'feature': feature_names,
'importance': feature_importance
}).sort_values('importance', ascending=False)
# Plot feature importance
plt.figure(figsize=(12, 8))
top_features = importance_df.head(15)
sns.barplot(x='importance', y='feature', data=top_features, palette='viridis')
plt.title('Top 15 Most Important Features for Diabetes Prediction')
plt.xlabel('Feature Importance')
plt.tight_layout()
plt.show()
# Print top 10 most important features
print("Top 10 Most Important Features:")
print(importance_df.head(10))
The Random Forest model revealed the most important predictors of diabetes:
- BMI (Body Mass Index) - The strongest predictor
- Age - Higher age groups showed increased risk
- Blood Pressure - Hypertension as a significant factor
- Physical Activity - Lower activity levels associated with higher risk
- Income Level - Socioeconomic factors playing a role
Model Performance Visualization
Let’s create comprehensive visualizations to compare our models. This can be a simple matplotlib and seaborn-based service that builds performance comparison charts.
# Plot ROC curves for all models
plt.figure(figsize=(12, 8))
for name, result in results.items():
fpr, tpr, _ = roc_curve(y_test, result['probabilities'])
auc_score = roc_auc_score(y_test, result['probabilities'])
plt.plot(fpr, tpr, label=f'{name} (AUC = {auc_score:.3f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves for All Models')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
# Create performance comparison bar chart
plt.figure(figsize=(12, 6))
# Accuracy comparison
plt.subplot(1, 2, 1)
accuracies = [results[model]['accuracy'] for model in results.keys()]
plt.bar(results.keys(), accuracies, color='skyblue', alpha=0.7)
plt.title('Model Accuracy Comparison')
plt.ylabel('Accuracy')
plt.xticks(rotation=45)
plt.ylim(0.7, 0.9)
# ROC AUC comparison
plt.subplot(1, 2, 2)
roc_aucs = [results[model]['roc_auc'] for model in results.keys()]
plt.bar(results.keys(), roc_aucs, color='lightcoral', alpha=0.7)
plt.title('Model ROC AUC Comparison')
plt.ylabel('ROC AUC')
plt.xticks(rotation=45)
plt.ylim(0.7, 0.9)
plt.tight_layout()
plt.show()
Model Performance Insights
Our best-performing models (Random Forest and XGBoost) achieved approximately 85.8% accuracy. However, accuracy alone doesn’t tell the full story in healthcare applications.
Why this matters: In healthcare, false negatives (missing a diabetic patient) can be more costly than false positives (incorrectly flagging someone as diabetic). This is why we also focused on:
- Precision: How many of our positive predictions were correct
- Recall: How many actual positive cases we caught
- F1-Score: Balanced measure of precision and recall
The overall architecture of this solution will depend on the decision we’ve made on our model evaluation metrics, whether it is accuracy-focused or balanced towards precision and recall.
Challenges and Lessons Learned
1. Data Quality Issues and Solutions
The dataset had some missing values and inconsistencies that required careful handling. Here’s how we addressed them. These design solutions can be combined together to achieve optimal results.
# Check data quality issues
print("Data Quality Report:")
print(f"Total missing values: {df.isnull().sum().sum()}")
print(f"Duplicate rows: {df.duplicated().sum()}")
# Check for outliers in numeric columns
numeric_cols = ['Age', 'BMI', 'GenHlth', 'MentHlth', 'PhysHlth']
for col in numeric_cols:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df[col] < Q1 - 1.5 * IQR) | (df[col] > Q3 + 1.5 * IQR)]
print(f"Outliers in {col}: {len(outliers)} ({len(outliers)/len(df)*100:.2f}%)")
# Handle outliers using IQR method
def handle_outliers(df, column):
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df[column] = df[column].clip(lower=lower_bound, upper=upper_bound)
return df
# Apply outlier handling to numeric columns
for col in numeric_cols:
df = handle_outliers(df, col)
2. Feature Selection and Engineering
Not all features contributed equally to prediction accuracy. We used feature importance analysis and correlation studies. This component reads from the feature correlation matrix and groups the features according to a set of criteria.
# Feature correlation analysis
plt.figure(figsize=(15, 10))
correlation_matrix = df[numeric_features + ['Diabetes_binary']].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,
square=True, linewidths=0.5)
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()
# Remove highly correlated features to reduce multicollinearity
def remove_highly_correlated_features(df, threshold=0.8):
correlation_matrix = df.corr().abs()
upper_triangle = correlation_matrix.where(
np.triu(np.ones(correlation_matrix.shape), k=1).astype(bool)
)
to_drop = [column for column in upper_triangle.columns
if any(upper_triangle[column] > threshold)]
print(f"Features to drop due to high correlation: {to_drop}")
return df.drop(columns=to_drop)
# Apply correlation-based feature selection
df_cleaned = remove_highly_correlated_features(df, threshold=0.8)
3. Model Interpretability and Validation
In healthcare, model interpretability is crucial. Here’s how we ensured our models were interpretable. This can be a simple Python-based service that builds interpretable model explanations.
# Create a function to explain individual predictions
def explain_prediction(model, preprocessor, sample_data, feature_names):
"""
Explain a single prediction using feature importance
"""
# Preprocess the sample
sample_processed = preprocessor.transform(sample_data)
# Get feature importance
if hasattr(model, 'feature_importances_'):
importance = model.feature_importances_
else:
# For models without feature_importances_, use coefficients
importance = np.abs(model.coef_[0]) if hasattr(model, 'coef_') else None
if importance is not None:
# Create explanation dataframe
explanation = pd.DataFrame({
'feature': feature_names,
'value': sample_processed[0],
'importance': importance,
'contribution': sample_processed[0] * importance
}).sort_values('contribution', key=abs, ascending=False)
return explanation.head(10)
return None
# Example: Explain a prediction for a sample case
sample_case = X_test.iloc[0:1]
explanation = explain_prediction(
rf_resampled.named_steps['classifier'],
rf_resampled.named_steps['preprocessor'],
sample_case,
feature_names
)
if explanation is not None:
print("Feature Contribution to Prediction:")
print(explanation)
# Cross-validation to ensure model stability
from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(
rf_resampled,
X_train_resampled,
y_train_resampled,
cv=5,
scoring='accuracy'
)
print(f"\nCross-validation scores: {cv_scores}")
print(f"Mean CV accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
4. Hyperparameter Tuning
We used GridSearchCV to optimize our best-performing model. Implementation tip: Use scikit-learn’s GridSearchCV or similar services that can perform hyperparameter optimization.
# Define parameter grid for Random Forest
param_grid = {
'classifier__n_estimators': [50, 100, 200],
'classifier__max_depth': [5, 10, 15, None],
'classifier__min_samples_split': [2, 5, 10],
'classifier__min_samples_leaf': [1, 2, 4]
}
# Perform grid search
grid_search = GridSearchCV(
rf_resampled,
param_grid,
cv=3,
scoring='accuracy',
n_jobs=-1,
verbose=1
)
grid_search.fit(X_train_resampled, y_train_resampled)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.4f}")
# Use the best model
best_model = grid_search.best_estimator_
y_pred_best = best_model.predict(X_test)
print(f"Best model test accuracy: {accuracy_score(y_test, y_pred_best):.4f}")
Practical Applications
This model could be integrated into any existing services e.g. healthcare systems or preventive care products. The notification service should be isolated into its own set of components from its main service.
- Primary care screening tools to identify high-risk patients
- Population health management programs
- Preventive care initiatives in healthcare systems
- Research studies on diabetes risk factors
Next Steps and Improvements
While our model shows promising results, there are several areas for improvement. These solutions have their own pros and cons. Having all of them side by side will deliver the best results.
- External validation on different populations
- Real-time data integration for continuous monitoring
- Integration with electronic health records
- Development of a user-friendly interface for healthcare providers
Implementation tip: Start with external validation, then as your product scales and expands to real-time monitoring, you can start looking into data integration. Then, when your model starts to reach a point where it’s causing too much load to your healthcare systems, you can consider optimization techniques.
Conclusion
Building this diabetes prediction model was an excellent exercise in applying machine learning to real-world healthcare problems. The project reinforced several important principles:
- Data quality is paramount in healthcare applications
- Class imbalance requires careful consideration and appropriate techniques
- Model interpretability is crucial for healthcare adoption
- Multiple algorithms should be tested to find the best solution
The 85.8% accuracy achieved by our best models demonstrates the potential of machine learning in diabetes risk assessment. However, it’s important to remember that this is a screening tool, not a diagnostic replacement.
As I continue my journey in AI and healthcare applications, I’m excited to explore more sophisticated techniques and larger datasets. The intersection of machine learning and healthcare offers tremendous opportunities to improve patient outcomes and advance medical research.
What’s your experience with healthcare machine learning? I’d love to hear about your projects or thoughts on applying AI to medical applications. Feel free to share in the comments below!
This project was part of my ongoing exploration into healthcare applications of machine learning. For more on my AI learning journey, check out my previous posts on image classification and why I went back to school for AI.