Accurate and interpretable prediction of chemical oxygen demand using explainable boosting algorithms with SHAP analysis – Nature
Report on Accurate and Interpretable Prediction of Chemical Oxygen Demand (COD) Using Explainable Boosting Algorithms with SHAP Analysis
Introduction
The degradation of water quality is a critical global issue impacting ecosystems, public health, and economic stability, aligning with the United Nations Sustainable Development Goals (SDGs), particularly SDG 6 (Clean Water and Sanitation) and SDG 15 (Life on Land). Chemical Oxygen Demand (COD) serves as a fundamental indicator of water pollution, reflecting the oxygen required to chemically oxidize organic and inorganic matter in water bodies.
Accurate forecasting of COD is essential for sustainable water quality management and pollution mitigation. Traditional models face challenges due to the complex interplay of chemical, physical, and hydrological processes influencing COD variability. Recent advances in machine learning (ML) and deep learning (DL) offer promising alternatives by capturing nonlinear relationships without explicit physical formulations, supporting SDG 9 (Industry, Innovation, and Infrastructure) through technological innovation.
Objectives
- To evaluate six ensemble boosting models—AdaBoost, CatBoost, XGBoost, LightGBM, HistGBRT, and NGBoost—for predicting COD from multiple water quality parameters.
- To enhance model interpretability using SHapley Additive exPlanations (SHAP) to identify key drivers of COD dynamics.
- To provide a robust, interpretable modeling framework supporting sustainable water quality management aligned with SDG 6.
Materials and Methods
Study Area and Data
The study was conducted at two monitoring stations in South Korea: Toilchun and Hwangji, located upstream of the Yeongju Dam. These stations influence eutrophication processes within the dam reservoir, making COD prediction vital for assessing water quality and supporting SDG 6.
Long-term datasets comprising water quality and discharge parameters were used, including potential of hydrogen (pH), dissolved oxygen (DO), biochemical oxygen demand (BOD₅), suspended solids (SS), total phosphorus (TP), total nitrogen (TN), total organic carbon (TOC), electrical conductivity (SC), water temperature (Tw), and station discharge (DIS).
Input Combinations
- Nine input combinations of varying complexity were constructed to evaluate model performance.
- TOC and SC were used as basic units for input combinations, reflecting their importance in water quality dynamics.
Model Evaluation Metrics
Model performance was assessed using the following criteria:
- Root-Mean-Square Error (RMSE)
- Mean Absolute Error (MAE)
- Nash–Sutcliffe Efficiency (NSE)
- Correlation Coefficient (R)
- Percent Bias (PBIAS)
Machine Learning Models
AdaBoost (Adaptive Boosting)
AdaBoost combines multiple weak classifiers to form a strong predictive model by adaptively weighting misclassified samples, enhancing prediction accuracy and robustness.
CatBoost (Categorical Boosting)
CatBoost handles categorical features effectively using ordered boosting and target-based encoding, improving generalization and reducing overfitting risks.
HistGBRT (Histogram Gradient Boosting)
HistGBRT accelerates training by discretizing continuous features into histograms, reducing computational complexity while maintaining accuracy.
LightGBM (Light Gradient Boosting Machine)
LightGBM introduces Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) to improve computational efficiency and accuracy, handling categorical variables natively.
NGBoost (Natural Gradient Boosting)
NGBoost provides probabilistic predictions by modeling the entire conditional distribution of COD, enabling uncertainty quantification and supporting risk-informed decision-making aligned with SDG 13 (Climate Action).
XGBoost (Extreme Gradient Boosting)
XGBoost constructs an ensemble of decision trees focusing on correcting residual errors iteratively, achieving high flexibility and robustness in regression tasks.
Results and Discussion
Mathematical Analysis
- NGBoost and CatBoost demonstrated superior predictive accuracy and stability, particularly in validation datasets at both stations.
- XGBoost showed near-perfect training performance but signs of overfitting, highlighting the importance of model generalization.
- Models using comprehensive input variables (SS, TN, TOC, SC, BOD₅) achieved better performance, emphasizing the complexity of COD dynamics.
Visualization Analysis
- Scatter plots, boxplots, violin plots, Taylor diagrams, Circos, and Chord diagrams confirmed the quantitative findings, with CatBoost and NGBoost showing closer agreement with observed COD values.
- Systematic underprediction of minimum COD values was observed, indicating model bias towards average pollution levels.
- Differences in model performance between stations reflect local hydro-environmental variability, underscoring the need for site-specific management strategies.
Interpretability with SHAP Analysis
- SHAP identified Total Organic Carbon (TOC), Biochemical Oxygen Demand (BOD₅), and Suspended Solids (SS) as the most influential variables controlling COD dynamics, consistent with biochemical and hydrological processes.
- At Toilchun, Total Phosphorus (TP) and station discharge (DIS) also significantly influenced COD, indicating non-point source pollution impacts.
- SHAP provides transparent insights into model decisions, enhancing trust and supporting SDG 6 by enabling informed water quality management.
Implications for Sustainable Development Goals (SDGs)
- SDG 6 (Clean Water and Sanitation): The study advances water quality monitoring and pollution control by providing accurate, interpretable COD predictions, essential for safeguarding freshwater resources.
- SDG 9 (Industry, Innovation, and Infrastructure): The application of advanced machine learning models promotes innovation in environmental monitoring technologies.
- SDG 13 (Climate Action): NGBoost’s probabilistic framework supports uncertainty quantification, aiding adaptive management under climate variability.
- SDG 15 (Life on Land): Improved water quality assessment contributes to the protection of aquatic ecosystems and biodiversity.
Conclusion and Future Research
- NGBoost and CatBoost are recommended for COD prediction due to their balance of accuracy, robustness, and interpretability.
- SHAP analysis confirms the critical role of organic carbon and related parameters in influencing COD, providing actionable insights for water quality management.
- Future research should focus on:
- Explicit uncertainty quantification and validation of predictive intervals to enhance risk-informed decision-making.
- Cross-site and cross-basin validation to improve model transferability and support broader applications.
- Real-time applicability assessment considering sensor data availability and quality.
- Incorporation of additional water quality parameters and alternative ensemble strategies to further improve predictive performance.
- The study supports sustainable water management aligned with SDG 6 by providing a transparent and effective modeling framework for monitoring and controlling water pollution.
1. Sustainable Development Goals (SDGs) Addressed or Connected
- SDG 6: Clean Water and Sanitation
- The article focuses on predicting Chemical Oxygen Demand (COD), a key indicator of water pollution, which is crucial for effective water quality management and pollution control.
- The study supports sustainable management of water resources by improving prediction accuracy and interpretability of water quality models.
- SDG 3: Good Health and Well-being
- By addressing water quality and pollution control, the study indirectly contributes to reducing waterborne diseases and promoting public health.
- SDG 9: Industry, Innovation and Infrastructure
- The use of advanced machine learning models (boosting algorithms) and explainable AI techniques (SHAP) represents innovation in environmental monitoring infrastructure.
- SDG 13: Climate Action
- Improved water quality management can contribute to ecosystem resilience and adaptation to climate variability.
2. Specific Targets Under the Identified SDGs
- SDG 6: Clean Water and Sanitation
- Target 6.3: Improve water quality by reducing pollution, minimizing release of hazardous chemicals and materials, and substantially increasing water recycling and safe reuse.
- Target 6.5: Implement integrated water resources management at all levels, including transboundary cooperation as appropriate.
- Target 6.a: Expand international cooperation and capacity-building support to developing countries in water- and sanitation-related activities and programmes.
- SDG 3: Good Health and Well-being
- Target 3.9: Reduce the number of deaths and illnesses from hazardous chemicals and air, water, and soil pollution and contamination.
- SDG 9: Industry, Innovation and Infrastructure
- Target 9.5: Enhance scientific research, upgrade the technological capabilities of industrial sectors, including encouraging innovation and increasing the number of research and development workers.
- SDG 13: Climate Action
- Target 13.1: Strengthen resilience and adaptive capacity to climate-related hazards and natural disasters in all countries.
3. Indicators Mentioned or Implied to Measure Progress
- Indicators Related to Water Quality (SDG 6)
- Chemical Oxygen Demand (COD) levels as a measure of organic and inorganic pollution in water bodies.
- Biochemical Oxygen Demand (BOD₅), Total Organic Carbon (TOC), Suspended Solids (SS), Total Phosphorus (TP), Total Nitrogen (TN), pH, Dissolved Oxygen (DO), Electrical Conductivity (SC), Water Temperature (Tw), and Station Discharge (DIS) as water quality parameters influencing COD.
- Statistical performance indicators for model accuracy: Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Nash–Sutcliffe Efficiency (NSE), Correlation Coefficient (R), and Percent Bias (PBIAS) used to evaluate prediction accuracy of COD.
- Indicators Related to Innovation and Model Interpretability (SDG 9)
- Use of SHapley Additive exPlanations (SHAP) values to interpret feature importance and model decisions.
- Probabilistic prediction and uncertainty quantification via NGBoost model to support risk-informed decision-making.
- Indicators Related to Health and Environmental Impact (SDG 3)
- Reduction in COD and related water pollutants as an implied indicator for improved water safety and reduced health risks.
4. Table of SDGs, Targets, and Indicators
| SDGs | Targets | Indicators |
|---|---|---|
| SDG 6: Clean Water and Sanitation |
|
|
| SDG 3: Good Health and Well-being |
|
|
| SDG 9: Industry, Innovation and Infrastructure |
|
|
| SDG 13: Climate Action |
|
|
Source: nature.com
What is Your Reaction?
Like
0
Dislike
0
Love
0
Funny
0
Angry
0
Sad
0
Wow
0
