Unveiling social determinants of health impact on adverse pregnancy outcomes through natural language processing – Nature

Report on the Impact of Social Determinants of Health on Adverse Pregnancy Outcomes Using Natural Language Processing
Executive Summary
This report details a study on the extraction of Social Determinants of Health (SDoH) from unstructured electronic health records (EHRs) and their association with pregnancy outcomes, directly addressing the aims of Sustainable Development Goal 3 (SDG 3) for Good Health and Well-being. Natural Language Processing (NLP) models were trained on 86 clinical notes from the MIMIC-III database and externally validated on 171 notes from MIMIC-IV to assess generalizability. The study focused on three key SDoH domains: social support, occupation, and substance use. Results indicated that different NLP models performed best for each domain: a ClinicalBERT model for social support (F1-score: 0.92), a rule-based keyword processor for occupation (F1-score: 0.74), and a Word2Vec model for substance use (F1-score: 0.83). Crucially, regression analysis revealed that substance use was significantly associated with an increased risk of pregnancy complications (OR 6.47), while social support was associated with a significantly reduced risk (OR 0.07). This research demonstrates the viability of using NLP to systematically identify SDoH, providing actionable insights that can help healthcare systems develop targeted interventions to improve maternal health and advance global health equity as outlined in the SDGs.
1. Introduction: Aligning Maternal Health with Sustainable Development Goals
Adverse pregnancy outcomes, including preterm birth and maternal complications, constitute a major public health issue, undermining progress toward Sustainable Development Goal 3 (SDG 3), which aims to ensure healthy lives and promote well-being for all at all ages, with a specific target to reduce the global maternal mortality ratio. The conditions in which individuals are born, live, and work—known as Social Determinants of Health (SDoH)—are recognized by the World Health Organization as fundamental drivers of health equity and are critical to understanding and mitigating these adverse outcomes.
However, vital SDoH information is often embedded within unstructured free-text clinical notes in EHRs, making it difficult to access and analyze systematically. Manual extraction is resource-prohibitive. This study leverages Natural Language Processing (NLP) to automate the extraction of SDoH, bridging a critical data gap and enabling a more holistic approach to maternal healthcare that aligns with the principles of the SDGs.
1.1. Study Objectives
This study addresses these challenges through three primary objectives:
- To develop and compare various NLP methodologies (rule-based, word embedding, and contextual language models) for extracting SDoH information related to social support, occupation, and substance use from clinical notes.
- To assess the cross-dataset generalizability of the developed models by performing an external evaluation on a separate, temporally distinct dataset (MIMIC-IV).
- To quantify the association between the NLP-extracted SDoH factors and the incidence of pregnancy complications, thereby demonstrating their clinical relevance for risk stratification and contributing to the evidence base needed to achieve SDG 3.
2. Methodology
2.1. Data Sources and Cohort Selection
The study utilized discharge summaries from two public critical care databases:
- MIMIC-III: Used for model development and internal testing. A cohort of 86 notes from female patients with pregnancy-related ICD-9 codes was selected.
- MIMIC-IV: Used for external evaluation to test model generalizability. An independent cohort of 171 notes was selected using identical criteria.
Inclusion was limited to notes containing a “Social History” section to ensure the potential presence of SDoH information.
2.2. SDoH Factor Selection and Annotation
Three SDoH factors were chosen based on their documented impact on maternal health and their prevalence in clinical notes:
- Social Support: Labeled as present (1) if the note mentioned co-habitation or familial support; otherwise absent (0).
- Occupation: Labeled as present (1) for explicit mentions of employment; otherwise absent (0).
- Substance Use: Labeled as present (1) for any mention of current or past use of tobacco, alcohol, or illicit drugs; otherwise absent (0).
A multi-annotator protocol with consensus-based resolution was used for the MIMIC-IV dataset to ensure high-quality labels, achieving moderate to near-perfect inter-annotator agreement (Cohen’s kappa up to 0.91).
2.3. NLP Model Development and Evaluation
After standard text preprocessing (tokenization, stopword removal, negation handling), three distinct NLP approaches were developed and evaluated for each SDoH category:
- Rule-Based Approach: Utilized a keyword processor to identify predefined terms and phrases.
- Word2Vec Approach: Employed pre-trained word embeddings to capture semantic relationships, feeding these features into machine learning classifiers (Random Forest, Support Vector Classifier, Decision Tree).
- ClinicalBERT Approach: Leveraged a powerful transformer-based model pre-trained on clinical text to generate contextual embeddings, which were then used with the same set of classifiers.
Models were trained on 60% of the MIMIC-III data and internally tested on the remaining 40%. The best-performing model for each SDoH category was then externally evaluated on the entire MIMIC-IV dataset. Performance was measured using accuracy, precision, recall, and F1-score.
2.4. Statistical Analysis
Logistic regression and chi-square tests were performed on the MIMIC-IV cohort to analyze the association between the presence of NLP-extracted SDoH factors (social support, occupation, substance use) and the binary outcome of pregnancy complications.
3. Results
3.1. NLP Model Performance
The models demonstrated strong performance, which generalized well to the external evaluation dataset. The optimal model varied by the linguistic complexity of the SDoH factor.
3.1.1. Internal and External Evaluation
On the external MIMIC-IV dataset, the best-performing models achieved high F1-scores, confirming their robustness:
- Social Support: The ClinicalBERT model with a Decision Tree classifier performed best, achieving an F1-score of 0.92. This indicates its strength in capturing nuanced, context-dependent language (e.g., “lacks social support,” “lives in a shelter”).
- Occupation: The Rule-Based Keyword Processor was most effective, with an F1-score of 0.74. This approach excelled at identifying explicit employment terms.
- Substance Use: The Word2Vec model with a Random Forest classifier was optimal, achieving an F1-score of 0.83 by leveraging semantic patterns in substance-related terminology.
3.2. Association Between SDoH and Pregnancy Complications
The analysis revealed statistically significant relationships between SDoH and maternal health outcomes, underscoring their critical role in achieving SDG 3.
- Substance Use: A documented history of substance use was associated with a more than six-fold increase in the odds of pregnancy complications (OR = 6.47, p
- Social Support: The presence of social support was strongly protective, associated with a 93% reduction in the odds of complications (OR = 0.07, p
- Occupation: No significant association was found between occupation status and pregnancy complications in this cohort (p = 0.49).
These findings highlight substance use as a major risk factor and social support as a key protective factor in maternal health.
4. Discussion and Conclusion: Advancing SDGs Through Health Informatics
This study successfully demonstrates that NLP is a powerful tool for unlocking critical SDoH information from unstructured clinical text, providing data essential for advancing SDG 3 (Good Health and Well-being) and SDG 10 (Reduced Inequalities).
The strong association between SDoH and pregnancy complications confirms that clinical care must extend beyond biomedical factors. The significant protective effect of social support and the detrimental impact of substance use provide clear targets for intervention. By automating the identification of these factors, healthcare systems can proactively connect at-risk individuals with necessary resources, such as counseling, addiction services, or community support networks. This targeted approach is fundamental to reducing maternal morbidity and mortality.
The finding that different NLP models excel for different SDoH categories is a key insight. While complex models like ClinicalBERT are necessary for nuanced concepts like social support, simpler, more interpretable rule-based methods are sufficient for more explicitly documented factors like occupation. This tailored approach allows for the efficient and effective deployment of technology in clinical settings.
The study’s limitations, including a small sample size and reliance on retrospective data, highlight the need for future research. However, the robust, generalizable performance of the models provides a strong foundation.
4.1. Conclusion
In conclusion, this report establishes a scalable framework for transforming unstructured clinical text into actionable insights for maternal health. By harnessing NLP to systematically identify SDoH, this work provides a direct pathway to enhance clinical risk stratification, enable timely interventions, and promote health equity. Such advancements are crucial for moving beyond reactive healthcare and building proactive, data-driven systems capable of achieving the ambitious vision of the Sustainable Development Goals for a healthier future for all.
Analysis of Sustainable Development Goals (SDGs) in the Article
1. Which SDGs are addressed or connected to the issues highlighted in the article?
-
SDG 3: Good Health and Well-being
This is the most central SDG to the article. The study’s primary goal is to improve maternal and infant health by understanding the impact of Social Determinants of Health (SDoH) on adverse pregnancy outcomes. The article explicitly mentions its alignment with SDG 3 in the discussion section: “In line with Sustainable Development Goal 3 (SDG 3), these findings illustrate how NLP methods can help identify modifiable social risk factors.” It directly addresses issues like “perinatal complications and mortality,” “maternal and infant health,” “preterm birth, low birth weight, and maternal complications such as preeclampsia,” and the harmful effects of “substance use.”
-
SDG 10: Reduced Inequalities
The article’s focus on Social Determinants of Health (SDoH) is fundamentally about understanding the drivers of health inequity. The introduction states that SDoH are “recognized by the World Health Organization as critical drivers of health equity.” The study analyzes how social factors like support systems, occupation, and substance use lead to different health outcomes, thereby highlighting inequalities. The demographic analysis in the “Population characteristics” section, which compares insurance coverage and ethnic representation, further underscores the theme of inequality across different patient populations.
-
SDG 8: Decent Work and Economic Growth
The article identifies “occupation” as a key SDoH and develops an NLP model to extract this information from clinical notes. It investigates how employment status (“employed”, “retired”, “student”, “jobless”) relates to pregnancy outcomes. The discussion mentions that “occupation, while seemingly straightforward, presents its own set of challenges in clinical text analysis,” and the methods section notes that “occupational factors—including employment, unemployment, and work-related stress—have significant effects on pregnancy outcomes.” This directly connects the research to the economic well-being and working conditions of individuals.
-
SDG 5: Gender Equality
While not explicitly named, the article’s focus on improving maternal health is a cornerstone of gender equality. By investigating risks specific to pregnancy and seeking to “mitigate adverse outcomes” for female patients, the research contributes to the health, well-being, and empowerment of women. Addressing adverse pregnancy outcomes is crucial for ensuring women can lead healthy and productive lives.
2. What specific targets under those SDGs can be identified based on the article’s content?
-
SDG 3: Good Health and Well-being
- Target 3.1: “By 2030, reduce the global maternal mortality ratio…” The article’s focus on understanding and mitigating “adverse pregnancy outcomes,” “maternal complications,” and “maternal… mortality” directly supports efforts to reduce maternal deaths.
- Target 3.2: “By 2030, end preventable deaths of newborns and children under 5 years of age…” The research addresses “infant mortality rates,” “preterm birth,” and “low birth weight,” which are leading causes of newborn deaths.
- Target 3.5: “Strengthen the prevention and treatment of substance abuse, including narcotic drug abuse and harmful use of alcohol.” The study dedicates a significant portion of its analysis to extracting data on “substance use” (tobacco, alcohol, drugs) and finds a strong correlation with pregnancy complications (OR = 6.47), highlighting the need for prevention and treatment in prenatal care.
-
SDG 10: Reduced Inequalities
- Target 10.2: “By 2030, empower and promote the social, economic and political inclusion of all, irrespective of… race, ethnicity… or other status.” By identifying how SDoH (like social support, occupation, ethnicity, and insurance status) create health disparities, the study provides a basis for creating more inclusive healthcare strategies that address the needs of vulnerable populations.
- Target 10.3: “Ensure equal opportunity and reduce inequalities of outcome…” The research quantifies “inequalities of outcome” by showing that factors like substance use and lack of social support significantly increase the odds of pregnancy complications. This evidence is a critical first step toward designing interventions to ensure more equitable health outcomes.
-
SDG 8: Decent Work and Economic Growth
- Target 8.5: “By 2030, achieve full and productive employment and decent work for all women and men…” The article’s analysis of “occupation” as an SDoH contributes to understanding the link between employment status and health, which is a key component of this target. It notes that “unemployment” is a factor considered in the analysis.
3. Are there any indicators mentioned or implied in the article that can be used to measure progress towards the identified targets?
-
Health Outcome Indicators
- Rate of pregnancy complications: The study’s primary outcome is the distinction between “complicated pregnancies” and “normal pregnancies,” which serves as a direct indicator for maternal health (Target 3.1).
- Rates of preterm birth and low birth weight: Mentioned in the introduction as key “adverse pregnancy outcomes” and major contributors to “infant mortality rates” (Target 3.2).
-
SDoH-Specific Indicators
- Prevalence of substance use during pregnancy: The article develops and evaluates NLP models specifically to identify and extract mentions of “current/past use of alcohol, tobacco, drugs.” The regression analysis (OR = 6.47) quantifies its impact, making it a measurable indicator for Target 3.5.
- Prevalence of social support: The study extracts information on social support (e.g., “lives with,” “strong familial support,” “homelessness,” “shelter care”) and measures its protective effect (OR = 0.07), serving as an indicator of social well-being.
- Employment status: The model for “occupation” extracts data on whether a patient is employed, unemployed, or a student, which is a direct indicator related to Target 8.5.
-
Inequality and Technology Indicators
- Odds Ratios (OR) for SDoH: The regression analysis provides specific odds ratios (e.g., OR for substance use, OR for social support) that quantify the level of health inequality based on social factors, directly measuring “inequalities of outcome” (Target 10.3).
- NLP Model Performance Metrics (F1-score, Accuracy): The article provides detailed metrics (e.g., F1-score of 0.92 for social support, 0.83 for substance use) which act as indicators of the feasibility and accuracy of automatically monitoring SDoH from clinical records.
- Demographic Data Disaggregation: The data presented in Table 1 on “Insurance coverage patterns” and “Ethnic representation” are examples of disaggregated data needed to monitor inequalities (Target 10.2).
4. Table of SDGs, Targets, and Indicators
SDGs | Targets | Indicators Identified in the Article |
---|---|---|
SDG 3: Good Health and Well-being | 3.1: Reduce maternal mortality. 3.2: End preventable deaths of newborns. |
|
SDG 3: Good Health and Well-being | 3.5: Strengthen prevention and treatment of substance abuse. |
|
SDG 10: Reduced Inequalities | 10.2: Promote social and economic inclusion. 10.3: Ensure equal opportunity and reduce inequalities of outcome. |
|
SDG 8: Decent Work and Economic Growth | 8.5: Achieve full and productive employment. |
|
Source: nature.com