Building a Diabetes Prediction Model: From Data to Insights

Ardy included in Development Tutorial Machine-Learning Healthcare

2025-07-25 2789 words 14 minutes

Contents

Building a Diabetes Prediction Model: From Data to Insights

Diabetes affects millions of people worldwide, and early detection can significantly improve health outcomes. In this post, I’ll walk you through my journey of building a machine learning model to predict diabetes risk using real health data from the CDC (Centers for Disease Control and Prevention).

Source: Unsplash

This project represents my latest exploration into healthcare machine learning, building on my previous work with image classification and other AI applications. The goal was to create a robust model that could help identify individuals at risk of diabetes based on various health indicators.

We, software engineers and data scientists, spend a good chunk of time building models that can make a real impact. And we should because predictive modeling in healthcare is an important activity that allows us to:

Detect potential health risks before they become serious problems
Check if our models are doing what they’re supposed to do
Ensure our predictions are reliable and actionable
Maintain consistent model performance and interpretability

The Dataset: CDC Diabetes Health Indicators

We’re working with the CDC Diabetes Health Indicators dataset, which contains comprehensive health information from over 240,000 individuals. This dataset includes various health metrics that could be predictive of diabetes risk.

The dataset schema will require the following key features:

Demographic data: Age, sex, education level, income
Health indicators: BMI, blood pressure, cholesterol levels
Lifestyle factors: Physical activity, smoking status, alcohol consumption
Medical history: Heart disease, stroke, kidney disease
Target variable: Diabetes status (binary: diabetic/non-diabetic)

This can be a comprehensive dataset with health information that we expect to have mostly read operations for analysis and model training. Writes are expected when we need to update the dataset with new health records.

Setting Up Our Environment

First, let’s import the necessary libraries and set up our workspace. This can be a straightforward setup with Python, pandas, and scikit-learn with a Jupyter notebook environment and a data processing pipeline.

        
        
        
    
# Import necessary libraries
import numpy as np
import pandas as pd

# Import visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Import model libraries
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.pipeline import Pipeline

# Import sampler libraries for handling imbalanced data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as imbPipeline

# Set the decimal format for cleaner output
pd.options.display.float_format = "{:.2f}".format

Data Exploration and Preprocessing

Loading and Initial Data Exploration

First, let’s load our dataset and get a sense of what we’re working with:

        
        
        
    
# Load the dataset
df = pd.read_csv('diabetes_health_indicators.csv')

# Basic information about our dataset
print(f"Dataset shape: {df.shape}")
print(f"Number of features: {df.shape[1]}")
print(f"Number of samples: {df.shape[0]}")

# Check the target variable distribution
diabetes_counts = df['Diabetes_binary'].value_counts()
print("\nDiabetes Distribution:")
print(diabetes_counts)
print(f"\nPercentage of diabetic individuals: {(diabetes_counts[1]/len(df)*100):.2f}%")

Understanding the Data Distribution

One of the first challenges we encountered was class imbalance. The dataset showed a significant imbalance between diabetic and non-diabetic individuals:

Non-diabetic individuals: ~85% of the dataset
Diabetic individuals: ~15% of the dataset

This imbalance is typical in healthcare datasets but poses challenges for model training. We don’t want to be that data scientist who keeps reminding everyone about class imbalance issues. Let’s visualize this:

        
        
        
    
# Visualize the class distribution
plt.figure(figsize=(12, 4))

# Bar chart
plt.subplot(1, 3, 1)
sns.countplot(x='Diabetes_binary', data=df, palette=['skyblue', 'lightcoral'])
plt.title('Diabetes Distribution (Bar Chart)')
plt.xlabel('Diabetes Status (0=Non-Diabetic, 1=Diabetic)')
plt.ylabel('Count')

# Pie chart
plt.subplot(1, 3, 2)
diabetes_counts = df['Diabetes_binary'].value_counts()
plt.pie(diabetes_counts, labels=['Non-Diabetic (0)', 'Diabetic (1)'], 
        autopct='%1.1f%%', startangle=90, colors=['skyblue', 'lightcoral'])
plt.title('Diabetes Distribution (Pie Chart)')

# Table
plt.subplot(1, 3, 3)
table_data = [[label, count] for label, count in diabetes_counts.items()]
table = plt.table(cellText=table_data, colLabels=['Diabetes Status', 'Count'], loc='center')
table.auto_set_font_size(False)
table.set_fontsize(12)
plt.axis('off')
plt.title('Diabetes Distribution (Table)')

plt.tight_layout()
plt.show()

Data Preprocessing and Feature Engineering

Before training our models, we need to prepare the data properly. This can be a RESTful or a Python-based service with pandas and scikit-learn with a data preprocessing pipeline and a feature engineering system.

        
        
        
    
# Convert target variable to numeric
df['Diabetes_binary'] = df['Diabetes_binary'].map({'Non-Diabetic': 0, 'Diabetic': 1})

# Create age groups for better feature representation
def create_age_groups(age):
    if age < 35:
        return 'Young'
    elif age < 65:
        return 'Middle-aged'
    else:
        return 'Senior'

df['Age_Group'] = df['Age'].apply(create_age_groups)

# Create BMI categories
def categorize_bmi(bmi):
    if bmi < 18.5:
        return 'Underweight'
    elif bmi < 25:
        return 'Normal'
    elif bmi < 30:
        return 'Overweight'
    else:
        return 'Obese'

df['BMI_Category'] = df['BMI'].apply(categorize_bmi)

# Create overall health score (composite of multiple health indicators)
health_indicators = ['GenHlth', 'MentHlth', 'PhysHlth']
df['OverallHlthScore'] = df[health_indicators].mean(axis=1)

# Check for missing values
print("Missing values in each column:")
print(df.isnull().sum())

# Handle missing values
df = df.fillna(df.median())  # For numeric columns
df = df.fillna(df.mode().iloc[0])  # For categorical columns

Feature Selection and Correlation Analysis

Let’s analyze which features are most correlated with diabetes:

        
        
        
    
# Calculate correlation with target variable
corr = df.corr()
target_corr = corr['Diabetes_binary'].drop('Diabetes_binary')

# Plot correlation with diabetes
plt.figure(figsize=(10, 6))
target_corr.sort_values(ascending=True).plot(kind='barh')
plt.title('Correlation with Diabetes')
plt.xlabel('Correlation Coefficient')
plt.tight_layout()
plt.show()

# Show top 10 most correlated features
print("Top 10 features most correlated with diabetes:")
print(target_corr.abs().sort_values(ascending=False).head(10))

Model Development Strategy

Now what? There are different solutions that we can go about, each with its own pros and cons. These design solutions can be combined together to achieve optimal results. Let’s prepare our data for modeling and implement our machine learning pipeline:

Data Preparation for Modeling

        
        
        
    
# Separate features and target
features = df.drop(['Diabetes_binary'], axis=1)
target = df['Diabetes_binary']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    features, target, test_size=0.2, random_state=42, stratify=target
)

print(f"Training set size: {X_train.shape[0]}")
print(f"Testing set size: {X_test.shape[0]}")
print(f"Training set diabetes rate: {y_train.mean():.3f}")
print(f"Testing set diabetes rate: {y_test.mean():.3f}")

# Define preprocessing pipeline
numeric_features = ['Age', 'BMI', 'GenHlth', 'MentHlth', 'PhysHlth', 'OverallHlthScore']
categorical_features = ['Sex', 'Education', 'Income', 'Age_Group', 'BMI_Category']

# Create preprocessing transformers
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(drop='first', sparse=False))
])

# Combine transformers
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

1. Random Forest Classifier Implementation

Our baseline model achieved solid performance with balanced class weights. This can be a straightforward approach if your service relies on ensemble methods for classification.

        
        
        
    
# Random Forest with balanced class weights
rf_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(
        n_estimators=100,
        max_depth=10,
        min_samples_split=5,
        min_samples_leaf=2,
        class_weight='balanced',
        random_state=42
    ))
])

# Train the model
print("Training Random Forest model...")
rf_pipeline.fit(X_train, y_train)

# Make predictions
y_pred_rf = rf_pipeline.predict(X_test)
y_pred_proba_rf = rf_pipeline.predict_proba(X_test)[:, 1]

# Evaluate the model
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score

print("Random Forest Results:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_rf):.4f}")
print(f"ROC AUC: {roc_auc_score(y_test, y_pred_proba_rf):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_rf))

# Plot confusion matrix
plt.figure(figsize=(8, 6))
cm = confusion_matrix(y_test, y_pred_rf)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Non-Diabetic', 'Diabetic'],
            yticklabels=['Non-Diabetic', 'Diabetic'])
plt.title('Random Forest Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()

2. Handling Class Imbalance with SMOTE

To address the class imbalance, we implemented SMOTE (Synthetic Minority Over-sampling Technique). This is a separate system that is worthy of its own post, but let’s treat this as a black box that serves as an abstraction between our imbalanced data and balanced training sets.

        
        
        
    
# SMOTE for handling class imbalance
over = SMOTE(sampling_strategy=0.7, random_state=42)
under = RandomUnderSampler(sampling_strategy=0.8, random_state=42)

# Create sampling pipeline
sampling_pipeline = imbPipeline([
    ('over', over),
    ('under', under)
])

# Apply sampling to training data
X_train_resampled, y_train_resampled = sampling_pipeline.fit_resample(X_train, y_train)

print("Original training set distribution:")
print(pd.Series(y_train).value_counts())
print("\nResampled training set distribution:")
print(pd.Series(y_train_resampled).value_counts())

# Train Random Forest on resampled data
rf_resampled = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(
        n_estimators=100,
        max_depth=10,
        class_weight='balanced',
        random_state=42
    ))
])

rf_resampled.fit(X_train_resampled, y_train_resampled)
y_pred_resampled = rf_resampled.predict(X_test)
y_pred_proba_resampled = rf_resampled.predict_proba(X_test)[:, 1]

print("\nRandom Forest with SMOTE Results:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_resampled):.4f}")
print(f"ROC AUC: {roc_auc_score(y_test, y_pred_proba_resampled):.4f}")

3. Multiple Algorithm Comparison

We tested several algorithms to find the best performer. Here’s how we implemented each. These are some strategies to consider when choosing your machine learning algorithms:

        
        
        
    
# Import additional algorithms
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
import xgboost as xgb

# Define our models
models = {
    'Random Forest': RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42),
    'XGBoost': xgb.XGBClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(probability=True, random_state=42),
    'KNN': KNeighborsClassifier(n_neighbors=5),
    'AdaBoost': AdaBoostClassifier(n_estimators=100, random_state=42),
    'Logistic Regression': LogisticRegression(random_state=42)
}

# Train and evaluate all models
results = {}

for name, model in models.items():
    print(f"\nTraining {name}...")
    
    # Create pipeline
    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', model)
    ])
    
    # Train model
    pipeline.fit(X_train_resampled, y_train_resampled)
    
    # Make predictions
    y_pred = pipeline.predict(X_test)
    y_pred_proba = pipeline.predict_proba(X_test)[:, 1]
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_pred_proba)
    
    results[name] = {
        'accuracy': accuracy,
        'roc_auc': roc_auc,
        'predictions': y_pred,
        'probabilities': y_pred_proba
    }
    
    print(f"{name} - Accuracy: {accuracy:.4f}, ROC AUC: {roc_auc:.4f}")

# Display results comparison
results_df = pd.DataFrame({
    'Model': list(results.keys()),
    'Accuracy': [results[model]['accuracy'] for model in results.keys()],
    'ROC AUC': [results[model]['roc_auc'] for model in results.keys()]
})

print("\nModel Performance Comparison:")
print(results_df.sort_values('Accuracy', ascending=False))

Results Summary:

Random Forest: 85.8% accuracy
XGBoost: 85.8% accuracy
SVM: 83.7% accuracy
KNN: 78.7% accuracy
AdaBoost: 73.8% accuracy
Logistic Regression: 77.4% accuracy

Implementation tip: Use scikit-learn for traditional ML algorithms, XGBoost for gradient boosting, and consider the trade-offs when choosing between different algorithms.

Key Findings and Insights

Feature Importance Analysis

Let’s extract and visualize the feature importance from our Random Forest model. This component reads from the model’s feature importance attributes and provides insights into what drives our predictions.

        
        
        
    
# Get feature names after preprocessing
feature_names = (numeric_features + 
                [f"{col}_{val}" for col, vals in 
                 preprocessor.named_transformers_['cat'].named_steps['onehot'].categories_ 
                 for val in vals[1:]])

# Extract feature importance from the best model
best_rf_model = rf_resampled.named_steps['classifier']
feature_importance = best_rf_model.feature_importances_

# Create feature importance dataframe
importance_df = pd.DataFrame({
    'feature': feature_names,
    'importance': feature_importance
}).sort_values('importance', ascending=False)

# Plot feature importance
plt.figure(figsize=(12, 8))
top_features = importance_df.head(15)
sns.barplot(x='importance', y='feature', data=top_features, palette='viridis')
plt.title('Top 15 Most Important Features for Diabetes Prediction')
plt.xlabel('Feature Importance')
plt.tight_layout()
plt.show()

# Print top 10 most important features
print("Top 10 Most Important Features:")
print(importance_df.head(10))

The Random Forest model revealed the most important predictors of diabetes:

BMI (Body Mass Index) - The strongest predictor
Age - Higher age groups showed increased risk
Blood Pressure - Hypertension as a significant factor
Physical Activity - Lower activity levels associated with higher risk
Income Level - Socioeconomic factors playing a role

Model Performance Visualization

Let’s create comprehensive visualizations to compare our models. This can be a simple matplotlib and seaborn-based service that builds performance comparison charts.

        
        
        
    
# Plot ROC curves for all models
plt.figure(figsize=(12, 8))

for name, result in results.items():
    fpr, tpr, _ = roc_curve(y_test, result['probabilities'])
    auc_score = roc_auc_score(y_test, result['probabilities'])
    plt.plot(fpr, tpr, label=f'{name} (AUC = {auc_score:.3f})')

plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves for All Models')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Create performance comparison bar chart
plt.figure(figsize=(12, 6))

# Accuracy comparison
plt.subplot(1, 2, 1)
accuracies = [results[model]['accuracy'] for model in results.keys()]
plt.bar(results.keys(), accuracies, color='skyblue', alpha=0.7)
plt.title('Model Accuracy Comparison')
plt.ylabel('Accuracy')
plt.xticks(rotation=45)
plt.ylim(0.7, 0.9)

# ROC AUC comparison
plt.subplot(1, 2, 2)
roc_aucs = [results[model]['roc_auc'] for model in results.keys()]
plt.bar(results.keys(), roc_aucs, color='lightcoral', alpha=0.7)
plt.title('Model ROC AUC Comparison')
plt.ylabel('ROC AUC')
plt.xticks(rotation=45)
plt.ylim(0.7, 0.9)

plt.tight_layout()
plt.show()

Model Performance Insights

Our best-performing models (Random Forest and XGBoost) achieved approximately 85.8% accuracy. However, accuracy alone doesn’t tell the full story in healthcare applications.

Why this matters: In healthcare, false negatives (missing a diabetic patient) can be more costly than false positives (incorrectly flagging someone as diabetic). This is why we also focused on:

Precision: How many of our positive predictions were correct
Recall: How many actual positive cases we caught
F1-Score: Balanced measure of precision and recall

The overall architecture of this solution will depend on the decision we’ve made on our model evaluation metrics, whether it is accuracy-focused or balanced towards precision and recall.

Challenges and Lessons Learned

1. Data Quality Issues and Solutions

The dataset had some missing values and inconsistencies that required careful handling. Here’s how we addressed them. These design solutions can be combined together to achieve optimal results.

        
        
        
    
# Check data quality issues
print("Data Quality Report:")
print(f"Total missing values: {df.isnull().sum().sum()}")
print(f"Duplicate rows: {df.duplicated().sum()}")

# Check for outliers in numeric columns
numeric_cols = ['Age', 'BMI', 'GenHlth', 'MentHlth', 'PhysHlth']
for col in numeric_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    outliers = df[(df[col] < Q1 - 1.5 * IQR) | (df[col] > Q3 + 1.5 * IQR)]
    print(f"Outliers in {col}: {len(outliers)} ({len(outliers)/len(df)*100:.2f}%)")

# Handle outliers using IQR method
def handle_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    df[column] = df[column].clip(lower=lower_bound, upper=upper_bound)
    return df

# Apply outlier handling to numeric columns
for col in numeric_cols:
    df = handle_outliers(df, col)

2. Feature Selection and Engineering

Not all features contributed equally to prediction accuracy. We used feature importance analysis and correlation studies. This component reads from the feature correlation matrix and groups the features according to a set of criteria.

        
        
        
    
# Feature correlation analysis
plt.figure(figsize=(15, 10))
correlation_matrix = df[numeric_features + ['Diabetes_binary']].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
            square=True, linewidths=0.5)
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()

# Remove highly correlated features to reduce multicollinearity
def remove_highly_correlated_features(df, threshold=0.8):
    correlation_matrix = df.corr().abs()
    upper_triangle = correlation_matrix.where(
        np.triu(np.ones(correlation_matrix.shape), k=1).astype(bool)
    )
    
    to_drop = [column for column in upper_triangle.columns 
               if any(upper_triangle[column] > threshold)]
    
    print(f"Features to drop due to high correlation: {to_drop}")
    return df.drop(columns=to_drop)

# Apply correlation-based feature selection
df_cleaned = remove_highly_correlated_features(df, threshold=0.8)

3. Model Interpretability and Validation

In healthcare, model interpretability is crucial. Here’s how we ensured our models were interpretable. This can be a simple Python-based service that builds interpretable model explanations.

        
        
        
    
# Create a function to explain individual predictions
def explain_prediction(model, preprocessor, sample_data, feature_names):
    """
    Explain a single prediction using feature importance
    """
    # Preprocess the sample
    sample_processed = preprocessor.transform(sample_data)
    
    # Get feature importance
    if hasattr(model, 'feature_importances_'):
        importance = model.feature_importances_
    else:
        # For models without feature_importances_, use coefficients
        importance = np.abs(model.coef_[0]) if hasattr(model, 'coef_') else None
    
    if importance is not None:
        # Create explanation dataframe
        explanation = pd.DataFrame({
            'feature': feature_names,
            'value': sample_processed[0],
            'importance': importance,
            'contribution': sample_processed[0] * importance
        }).sort_values('contribution', key=abs, ascending=False)
        
        return explanation.head(10)
    
    return None

# Example: Explain a prediction for a sample case
sample_case = X_test.iloc[0:1]
explanation = explain_prediction(
    rf_resampled.named_steps['classifier'],
    rf_resampled.named_steps['preprocessor'],
    sample_case,
    feature_names
)

if explanation is not None:
    print("Feature Contribution to Prediction:")
    print(explanation)

# Cross-validation to ensure model stability
from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(
    rf_resampled, 
    X_train_resampled, 
    y_train_resampled, 
    cv=5, 
    scoring='accuracy'
)

print(f"\nCross-validation scores: {cv_scores}")
print(f"Mean CV accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")

4. Hyperparameter Tuning

We used GridSearchCV to optimize our best-performing model. Implementation tip: Use scikit-learn’s GridSearchCV or similar services that can perform hyperparameter optimization.

        
        
        
    
# Define parameter grid for Random Forest
param_grid = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth': [5, 10, 15, None],
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 4]
}

# Perform grid search
grid_search = GridSearchCV(
    rf_resampled,
    param_grid,
    cv=3,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train_resampled, y_train_resampled)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.4f}")

# Use the best model
best_model = grid_search.best_estimator_
y_pred_best = best_model.predict(X_test)
print(f"Best model test accuracy: {accuracy_score(y_test, y_pred_best):.4f}")

Practical Applications

This model could be integrated into any existing services e.g. healthcare systems or preventive care products. The notification service should be isolated into its own set of components from its main service.

Primary care screening tools to identify high-risk patients
Population health management programs
Preventive care initiatives in healthcare systems
Research studies on diabetes risk factors

Next Steps and Improvements

While our model shows promising results, there are several areas for improvement. These solutions have their own pros and cons. Having all of them side by side will deliver the best results.

External validation on different populations
Real-time data integration for continuous monitoring
Integration with electronic health records
Development of a user-friendly interface for healthcare providers

Implementation tip: Start with external validation, then as your product scales and expands to real-time monitoring, you can start looking into data integration. Then, when your model starts to reach a point where it’s causing too much load to your healthcare systems, you can consider optimization techniques.

Conclusion

Building this diabetes prediction model was an excellent exercise in applying machine learning to real-world healthcare problems. The project reinforced several important principles:

Data quality is paramount in healthcare applications
Class imbalance requires careful consideration and appropriate techniques
Model interpretability is crucial for healthcare adoption
Multiple algorithms should be tested to find the best solution

The 85.8% accuracy achieved by our best models demonstrates the potential of machine learning in diabetes risk assessment. However, it’s important to remember that this is a screening tool, not a diagnostic replacement.

As I continue my journey in AI and healthcare applications, I’m excited to explore more sophisticated techniques and larger datasets. The intersection of machine learning and healthcare offers tremendous opportunities to improve patient outcomes and advance medical research.

What’s your experience with healthcare machine learning? I’d love to hear about your projects or thoughts on applying AI to medical applications. Feel free to share in the comments below!

This project was part of my ongoing exploration into healthcare applications of machine learning. For more on my AI learning journey, check out my previous posts on image classification and why I went back to school for AI.