Exploration and analysis of risk factors for coronary artery disease with type 2 diabetes based on SHAP explainable machine learning algorithm – nature.com

Report on the Application of Machine Learning for Coronary Heart Disease and Type 2 Diabetes Risk Prediction in Alignment with Sustainable Development Goals
Executive Summary
This report details a study focused on developing an interpretable machine learning model to predict the comorbidity of Coronary Heart Disease (CHD) and Type 2 Diabetes Mellitus (T2DM). The research directly supports the United Nations Sustainable Development Goals (SDGs), particularly SDG 3 (Good Health and Well-being), by aiming to reduce premature mortality from non-communicable diseases (NCDs) through improved prevention and diagnosis. Furthermore, the study leverages advanced technology, contributing to SDG 9 (Industry, Innovation, and Infrastructure) by promoting scientific research and technological innovation in healthcare. By identifying key risk factors and creating an accessible predictive tool, this work also has implications for SDG 10 (Reduced Inequalities) by potentially making advanced diagnostics more affordable and widely available.
1. Introduction: Addressing Non-Communicable Diseases Through Innovation
The global rise of NCDs, such as CHD and T2DM, presents a significant challenge to achieving SDG 3, which targets a one-third reduction in premature mortality from NCDs by 2030. The co-occurrence of these two conditions, termed CHD-DM2, exacerbates patient risk and complicates clinical management. Conventional diagnostic methods are often costly and inaccessible, creating disparities in healthcare outcomes and hindering progress toward universal health coverage, a cornerstone of SDG 3.
This study addresses this gap by harnessing the power of machine learning, an innovation aligned with SDG 9. The objective was to develop and validate a robust, interpretable clinical risk prediction model for CHD-DM2. By identifying critical risk factors, the model aims to provide a low-cost, non-invasive tool to support clinical decision-making, facilitate early intervention, and ultimately contribute to better health outcomes and reduced mortality, in line with the ambitions of the SDGs.
2. Methodology: A Framework for Responsible Health Innovation
2.1. Study Population and Data Collection
A retrospective analysis was conducted on clinical data from 12,400 cardiovascular inpatients admitted between 2001 and 2018. The cohort comprised 10,257 CHD patients and 2,143 CHD-DM2 patients. This large-scale data utilization is fundamental for building effective public health tools and supports the evidence-based approach required to meet SDG targets.
2.2. Data Preprocessing and Feature Selection
To ensure model accuracy and reliability, a rigorous data preprocessing workflow was implemented, reflecting a commitment to high-quality scientific research as promoted by SDG 9.
- Handling Missing Data: Variables with over 30% missing values were excluded, while multiple imputation was used for variables with less than 30% missing data to maintain dataset integrity.
- Addressing Class Imbalance: The dataset exhibited a significant class imbalance between CHD and CHD-DM2 cases. The SMOTENC algorithm was applied to create a balanced dataset, preventing model bias and ensuring that the predictive tool is effective for the minority (CHD-DM2) group. This step is crucial for creating equitable health solutions, a principle of SDG 10.
- Feature Selection: A combination of univariate analysis and LASSO regression identified the most impactful predictors. This process reduced 62 potential variables to a final set of 25, enhancing model efficiency and interpretability.
2.3. Model Development and Evaluation
Seven machine learning models were developed to identify the most effective algorithm for this clinical prediction task. This comparative approach ensures the selection of a high-performing and robust solution.
- Logistic Regression
- Logistic_Lasso
- K-Nearest Neighbors (KNN)
- Support Vector Machine (SVM)
- XGBoost
- Random Forest (RF)
- LightGBM
Model performance was assessed using accuracy, sensitivity, specificity, and the Area Under the Curve (AUC). Decision Curve Analysis (DCA) was used to evaluate clinical utility. To enhance transparency and trust, which are critical for the adoption of AI in healthcare (SDG 9), SHAP (Shapley Additive Explanations) values were employed to interpret the model’s predictions.
3. Results: High-Performance Models for Improved Health Outcomes
3.1. Predictive Model Performance
The models trained on the balanced dataset demonstrated significantly superior performance, underscoring the importance of addressing data imbalance. The Random Forest (RF) model emerged as the top-performing algorithm.
- Accuracy: The RF model achieved perfect accuracy (1.0) on the balanced training and test sets.
- AUC: The RF model yielded an AUC of 1.0, indicating outstanding discriminative ability.
- Clinical Utility (DCA): The RF model showed the highest net benefit in the DCA, confirming its potential for practical clinical application in identifying at-risk patients, thereby directly contributing to the preventative goals of SDG 3.
3.2. Identification of Key Risk Factors
The analysis consistently identified a core set of variables as the most significant predictors for CHD-DM2 across multiple models. This finding is critical for guiding targeted public health interventions and patient monitoring.
- Primary Risk Factors: Diabetes History, Blood Glucose (BG), and Glycated Hemoglobin (HbA1c) were consistently ranked as the top contributors to CHD-DM2 risk.
- Other Significant Factors: Left ventricular end-systolic diameter (LVES), insulin secretion rate (ISR), and left anterior descending artery (LAD) status were also identified as important predictors.
3.3. Model Interpretability through SHAP Analysis
The SHAP analysis provided clear, interpretable insights into the model’s decision-making process, bridging the gap between complex AI (SDG 9) and clinical practice (SDG 3). The analysis confirmed that higher values for Diabetes History, BG, and HbA1c strongly increased the predicted risk of CHD-DM2. This transparency allows clinicians to understand and trust the model’s output, facilitating its integration into patient care pathways.
4. Discussion and Conclusion: Advancing the SDGs through Health Technology
This study successfully developed a highly accurate and interpretable machine learning model for predicting the risk of CHD-DM2. The findings have significant implications for advancing the Sustainable Development Goals.
- Contribution to SDG 3 (Good Health and Well-being): By identifying Diabetes History, BG, and HbA1c as primary risk factors, this research provides a clear focus for clinical monitoring and intervention. The developed RF model serves as a powerful, low-cost tool for early risk stratification, enabling proactive management that can reduce premature mortality from NCDs.
- Contribution to SDG 9 (Industry, Innovation, and Infrastructure): The application of advanced machine learning algorithms (RF, XGBoost) and interpretability techniques (SHAP) showcases the successful use of technological innovation to solve a pressing global health problem. This work contributes to the body of knowledge on AI in medicine and promotes the development of intelligent health infrastructure.
- Contribution to SDG 10 (Reduced Inequalities): An accessible, data-driven predictive tool can help overcome the economic and logistical barriers associated with traditional diagnostic methods, offering the potential to reduce health disparities and ensure more equitable access to preventative care.
Recommendations
It is recommended that healthcare institutions enhance the monitoring of patients with established CHD for risk factors associated with T2DM, particularly blood glucose and HbA1c levels. The integration of such validated, interpretable machine learning models into clinical workflows should be explored to support early diagnosis and the implementation of targeted intervention strategies, thereby accelerating progress toward achieving global health and well-being for all.
Analysis of Sustainable Development Goals in the Article
1. Which SDGs are addressed or connected to the issues highlighted in the article?
The primary SDG addressed in the article is:
- SDG 3: Good Health and Well-being. The article focuses on developing predictive models for Coronary Heart Disease (CHD) and Type 2 Diabetes Mellitus (T2DM), which are major non-communicable diseases (NCDs). The core aim is to improve clinical decision-making, enable early detection, and reduce mortality, which is central to ensuring healthy lives and promoting well-being for all at all ages.
2. What specific targets under those SDGs can be identified based on the article’s content?
Based on the article’s focus, the following specific targets under SDG 3 are relevant:
-
Target 3.4: By 2030, reduce by one-third premature mortality from non-communicable diseases through prevention and treatment and promote mental health and well-being.
- Explanation: The article directly addresses two major NCDs, CHD and T2DM. It states that T2DM is “associated with patient mortality” and that developing “effective non-invasive diagnostic tools is crucial for the early detection of CHD-DM2 and may significantly reduce patient mortality.” The study’s objective to create a risk prediction model for early intervention is a clear strategy for prevention and treatment aimed at reducing mortality from these NCDs.
-
Target 3.d: Strengthen the capacity of all countries, in particular developing countries, for early warning, risk reduction and management of national and global health risks.
- Explanation: The research employs advanced machine learning and AI technologies (“XGBoost, Random Forest (RF), LightGBM, Support Vector Machine (SVM)”) to create a “clinical risk prediction model for CHD-DM2.” This represents an effort to strengthen the capacity for health risk management. The article highlights the need for “low-cost, convenient, and effective non-invasive diagnostic tools,” and the development of such a model provides a sophisticated method for early warning and risk reduction, particularly in the context of the growing prevalence of NCDs in China, as mentioned in the introduction.
3. Are there any indicators mentioned or implied in the article that can be used to measure progress towards the identified targets?
Yes, the article mentions and implies several indicators that can measure progress:
-
For Target 3.4 (related to Indicator 3.4.1: Mortality rate attributed to cardiovascular disease, cancer, diabetes or chronic respiratory disease):
- Implied Indicators: The article’s entire premise is built on the high prevalence and mortality risk of CHD and T2DM. It identifies specific clinical markers and risk factors that contribute to this mortality. The key predictors identified—”Diabetes.History, blood glucose (BG), and HbA1c”—serve as measurable indicators for managing the risk of these NCDs. Progress can be measured by monitoring these indicators in at-risk populations and implementing the “targeted intervention strategies” recommended by the study to ultimately reduce mortality rates. The study’s dataset itself, with “10,257 cases of CHD and 2143 cases of CHD-DM2,” quantifies the disease burden being addressed.
-
For Target 3.d (related to Indicator 3.d.1: International Health Regulations (IHR) capacity and health emergency preparedness):
- Mentioned Indicators: The development and validation of the machine learning models are direct indicators of enhanced capacity. The article provides specific performance metrics for these models, such as “accuracy, sensitivity, specificity, AUC, ROC and DCA,” which quantify the effectiveness of this new risk management tool. The use of “SHAP values” to create an “interpretable model” is another indicator of a strengthened, more sophisticated approach to managing health risks, as it allows clinicians to understand and trust the AI’s predictions, facilitating its adoption in clinical practice.
4. Table of SDGs, Targets, and Indicators
SDGs | Targets | Indicators |
---|---|---|
SDG 3: Good Health and Well-being | 3.4: Reduce by one-third premature mortality from non-communicable diseases (NCDs) through prevention and treatment. |
|
SDG 3: Good Health and Well-being | 3.d: Strengthen the capacity for early warning, risk reduction, and management of health risks. |
|
Source: nature.com