Development and Validation of a Deep Learning Based Diabetes Prediction System Using a Nationwide Population-Based Cohort
Article information
Abstract
Background
Previously developed prediction models for type 2 diabetes mellitus (T2DM) have limited performance. We developed a deep learning (DL) based model using a cohort representative of the Korean population.
Methods
This study was conducted on the basis of the National Health Insurance Service-Health Screening (NHIS-HEALS) cohort of Korea. Overall, 335,302 subjects without T2DM at baseline were included. We developed the model based on 80% of the subjects, and verified the power in the remainder. Predictive models for T2DM were constructed using the recurrent neural network long short-term memory (RNN-LSTM) network and the Cox longitudinal summary model. The performance of both models over a 10-year period was compared using a time dependent area under the curve.
Results
During a mean follow-up of 10.4±1.7 years, the mean frequency of periodic health check-ups was 2.9±1.0 per subject. During the observation period, T2DM was newly observed in 8.7% of the subjects. The annual performance of the model created using the RNN-LSTM network was superior to that of the Cox model, and the risk factors for T2DM, derived using the two models were similar; however, certain results differed.
Conclusion
The DL-based T2DM prediction model, constructed using a cohort representative of the population, performs better than the conventional model. After pilot tests, this model will be provided to all Korean national health screening recipients in the future.
INTRODUCTION
The rising global prevalence of diabetes mellitus (DM) and its related complications have increased the burden on the global health care system [1]. Recent reports suggest that one in 11 adults worldwide have DM, and it is considered to be one of the major causes of reduced life expectancy [2].
However, type 2 diabetes mellitus (T2DM) is a preventable disease. Early screening and appropriate interventions may prevent the onset and progress of T2DM. Previous clinical trials have demonstrated the efficacy of preventive interventions in subjects at high-risk of T2DM [3,4]. In addition to preventing the onset of T2DM, interventions may prevent the occurrence of long-term complications [5,6]; reports suggest that this approach is cost-effective [7].
The effective prevention and management of T2DM in the population necessitates the accurate identification of subjects who may develop T2DM. Several researchers have attempted to develop models to predict the individual risk of T2DM [8-10]. However, the existing prediction model does not include various risk factors for T2DM, and its predictive power is limited [8,10-12].
Recently, with the development of artificial intelligence technology, efforts are being made to apply new techniques such as deep learning (DL) to existing disease models. DL is an algorithm used in the field of computer science, that identifies the patterns of large datasets and predicts the results [13,14].
It is known that the prediction accuracy of DL in various imaging types is comparable with that of skilled experts [15,16], and applications in clinical practice are also rising. Attempts have also been made to employ DL in the prediction of various chronic diseases including T2DM [17,18]. However, till date, the proposed model does not outperform conventional tools [19-22].
In this study, we constructed a DL-based T2DM prediction model using large scale longitudinal cohort data representative of the Korean population. We then compared this model to a conventional Cox regression based model and evaluated its performance and clinical utility.
METHODS
NHIS-Health Screening cohort
Except for 3% of ‘medical protection’ beneficiaries, 97% of the total Korean population is covered by a single health insurance system, namely, the National Health Insurance of Korea. Information on individual utilization of medical facilities, medications, and diagnostic codes, configured in the form of International Classification of Diseases, 10th revision (ICD-10) are archived in the National Health Insurance Service (NHIS) database [23]. In addition, the NHIS provides a biennial health check-up program for all beneficiaries over 40 years of age, that comprises evaluation of anthropometric parameters, a self-administered questionnaire on health related behavior, past medical history, family history, and laboratory tests.
The NHIS-Health Screening (NHIS-HEALS) cohort was established by including approximately 10% of the entire population of NHIS health check-ups between 2002 and 2003 [24]. The cohort comprises 514,795 individuals, and has been systemically sampled to represent the entire Korean population. The clinical course of the included subjects will be observed until follow-up is feasible, i.e., till death or immigration. It is currently possible to use the follow-up data until 2013 for research, after approval from the NHIS. All data for this cohort are provided to researchers after anonymization and de-identification.
Study subjects
From the overall cohort of 514,795 subjects, we excluded those with pre-existing type 1 DM or T2DM from the self-reported past medical history, and those with a fasting blood glucose (FBG) ≥126 mg/mL on baseline laboratory tests. We then excluded those with a diagnosis of DM (based on ICD-10 codes E10.x-E14.x, O24.x), or with prescriptions for anti-diabetic medication (oral hypoglycemic agents or insulin) in the health insurance claims database at the time of the baseline check-up. We also excluded those who died in 2002 to 2003, since serial clinical data on health check-ups or follow-ups were unavailable for determining the incidence of T2DM. Finally, 335,302 individuals were selected as candidates for the study, and only the latest health check-up data from the 2002 to 2006 period were included after the baseline date; 80% (268,241) of the subjects were randomly selected for inclusion during model development (Supplementary Fig. 1).
Clinical variables
All procedures of the national healthcare check-up were performed by experts according to standardized protocols [24]. The anthropometric parameters of systolic blood pressure (SBP), diastolic blood pressure (DBP), and body mass index (BMI) were used in this study. Among laboratory tests, FBG, total cholesterol (TC), aspartate aminotransferase, alanine aminotransferase, gamma-glutamyl transferase, and dip-stick based proteinuria tests were used. The personal behavior, past medical history, and family history of subjects were investigated using a questionnaire. For evaluating personal behavior, smoking, alcohol, and exercise habits were investigated. These three measures were classified as “yes” or “no” for current smoking status, “drinker” or “non-drinker” for alcohol consumption, and “yes” or “no” for exercise. The past medical and family history were examined for the presence of hypertension, heart disease, stroke, and other illnesses (including malignancy). The presence or absence of a condition was determined by the availability of a diagnosis from a doctor. Details of variables included in the analyses have been presented in Table 1. Additionally, in cases of missing data, multiple imputations were used under fully conditional specification [25] using the machine learning procedure [26].
Identification of new DM cases
The new onset of T2DM among the subjects was confirmed by their ICD-10 codes (E11.x-E14.x), prescriptions of anti-diabetic medications (oral anti-diabetic medications and/or insulin), and FBG levels. This definition is based on the consensus of relevant findings widely used in previous studies [27,28].
Construction of prediction models
Prediction models were constructed using records from baseline and follow-up visits (Supplementary Table 1). The intervals were defined as the periods from the first health check-up date to the date of diagnosis of T2DM, and to the end of the study for non-T2DM. The records used for analysis included health check-up data from the 2002 to 2006 period. For instance, if a patient diagnosed with T2DM in 2005 had two health check-ups in 2002 and 2004, the analysis used both health check-up records, and if a patient with T2DM in 2009 had a health check-up four times a year between 2002 and 2008, only the health check-up data available between 2002 and 2006 was used. This controlled the difference in the amount of data that the subjects had in the check-up records, by adjusting their amount.
The Cox regression model was first employed using longitudinal data, with higher accuracy compared to a single measurement method; this method has been described earlier to compare DL with longitudinal data. The Cox regression model used the mean, standard deviations (SDs), minimum and maximum values for continuous variables, and the mean and SDs for categorical variables; these were computed from the periodic health check-up data. The detailed methods for this Cox regression model using longitudinal data, and its improved accuracy over single-measures, have been explained previously [29].
For the DL algorithm, a recurrent neural network-long short-term memory (RNN-LSTM) network was used [30]. The variables used in the DL algorithm were the same as those used in the Cox regression model, with longitudinal data. We designed the LSTM model using the following structures: to optimize algorithm, RMSProp was used to update parameters through back-propagation [31], and hyper-parameters at a learning rate of 0.01 were constructed with a dropout probability of 50%, and a mini-batch of 64. The exact answer was one-hot encoded to be used as cross entropy in a loss function; there were two classes. The particulars of DL and the model building process have been proven (Appendix 1).
Converting the output variables for longitudinal study
The use of a Machine Learning-LSTM to determine the occurrence of disease at a certain point in time needed to be examined. As in previous studies using vector variables, we converted binary into multi-class output variables, which are vector types [32-35]. We analyzed the case every year through output variable conversion to identify the specific point in time at which a disease occurred. In the output layer, each node expresses a time interval from 1 to 10 years, at intervals of 1 year. The value of each node is the survival probability for that point in time. The probability of survival after disease occurrence is 0, and the probability of a disease occurring after the disease-free survival time for censored cases are estimated by the Kaplan-Meier survival function. The predicted outputs are the probability of survival at each time [34].
Solution to the problem of understanding classification decisions
In order to overcome problems that cannot explain the reason for classification, and to identify the effects of input variables, layer-wise relevance propagation (LRP) [36], one of the Explainable Artificial Intelligence (XAI) techniques used in artificial neural networks, were used [37,38].
The order of each variable was sorted in descending order by calculating the mean for the entire LRP output value for each input sample. The number of feature variables was n, the number of input samples was m, and the output value of the prediction model was o={o1… om}, the ranking of feature variables was expressed as follows.
Using this technique, we demonstrated the influence of feature variables that were used for building the model.
Evaluation of prediction performance
The performance of the constructed model was evaluated in the validation dataset, which included 20% of the subjects. We evaluated the area under the curve (AUC) every year by comparing the survival probability based on Cox regression, and the probability of DL using the actual answer. Therefore, using calculation of time dependent AUC each year, we confirmed the predicted performance of our models, the Cox regression and DL [39,40]. The calibration was used to compare observed with predicted event probabilities.
Statistical tools
All statistical analyses were conducted using the SAS version 9.4 (SAS Inc., Cary, NC, USA) and R version 3.3.3 (www.R-project.org) statistical software packages.
Ethical statement
This study was approved by the Institutional Review Board (IRB) of the Yonsei University, Severance Hospital, Seoul, Korea (IRB no. 4-2016-0383). The requirement for informed consent was waived by the IRB as de-identified data was used for analyses.
RESULTS
Characteristics of the subjects
The mean age of subjects in the training set was 51.8±9.1 years, and 149,723 (55.82%) were male (Table 1). The mean BMI, SBP, and DBP were 23.9±2.9 kg/m2, 126.0±17.6 mm Hg, and 79.3±11.6 mm Hg, respectively. Laboratory test results showed that mean FBG was 90.8±12.6 mg/dL, TC was 199.6±37.3 mg/dL, and proteinuria was present in 1.51% of the total subjects. Among the subjects, 85,774 (31.98%) were current smokers and 118,972 (44.35%) were regular alcohol drinkers; 113,809 (42.43%) exercised regularly. The mean duration of follow-up of the cohort was 10.4±1.7 years; 2.9±1.0 national health check-ups were performed during this period. The minimum and maximum health check-up frequencies per person were 2 and 5, respectively. In the selected individuals, 23,420 (8.7%) were diagnosed with T2DM during the follow-up period.
The characteristics of the validation and training sets were similar (Supplementary Table 2). In addition, the incidence of T2DM between the training and validation sets were also similar (Supplementary Table 3).
Hazard ratio for new onset T2DM in the Cox model
The Cox longitudinal summary model was used to estimate the hazard ratio (HR) and 95% confidence interval (CI) of clinical variables affecting new onset T2DM among subjects in the training set (Table 2). Various variables that significantly increased the HR for T2DM were identified. In particular, the HR of family history of DM (HR, 1.523; 95% CI, 1.462 to 1.586), age (HR, 1.369; 95% CI, 1.348 to 1.391), smoking (HR, 1.355; 95% CI, 1.308 to 1.405), personal history of heart disease (HR, 1.343; 95% CI, 1.254 to 1.439), and proteinuria (HR, 1.217; 95% CI, 1.090 to 1.359) were prominent among the variables. Conversely, the HR was significantly lower for male individuals (HR, 0.809; 95% CI, 0.773 to 0.846), alcohol drinkers (HR, 0.844; 95% CI, 0.816 to 0.873), and those who exercised (HR, 0.876; 95% CI, 0.852 to 0.901).
Clinical variables frequently observed in DL-based models
While constructing the DL-based model, the most frequently observed clinical variables in subjects with new T2DM were listed using the LRP algorithm (Table 3). Most of the variables were found to be similar to the risk factors of T2DM identified in the conventional model. However, the family or personal history related variables were not included in the 10 most frequently listed variables in the two methods.
Comparison of prediction models
The prediction performance of the Cox and DL-based prediction models was compared. The results demonstrated the performance of the DL-based model to be superior to that of the Cox model across all observation periods (Fig. 1).
The discriminative performances measured by AUC for 5 years were 0.842 (95% CI, 0.832 to 0.852) and 0.877 (95% CI, 0.869 to 0.885) in the Cox and DL models, respectively. In addition, the discriminative performances measured by AUC for 10 years were 0.807 (95% CI, 0.801 to 0.813) and 0.827 (95% CI, 0.821 to 0.833) in the Cox and DL models, respectively. Among the two predictive models, the DL-based model showed higher sensitivity for 5 years at 81.6% (95% CI, 79.8 to 83.4), and specificity, at 76.5% (95% CI, 76.2 to 76.8). This model also demonstrated higher sensitivity for 10 years, at 75.1% (95% CI, 73.9 to 76.2) and specificity, at 74.0% (95% CI, 73.7 to 74.4). The detailed analysis results of these two models have been summarized separately (Supplementary Table 4). The calibration results of both the models have also been summarized separately (Supplementary Fig. 2).
DISCUSSION
Effective screening of high-risk subjects in the population, and evidence-based interventions will help in improving public health, and will reduce the burden of T2DM on the national health care system [3,4]. Establishing public health system based interventions in countries or regions known to be at high risk of T2DM, including Korea, are expected to provide considerable benefits to the population. It is essential to develop an accurate model for predicting T2DM for achieving these goals.
However, many of the previous studies were not based on subjects that were representative of the general population, and their accuracy using conventional methodology was not satisfactory. In addition, since various factors influence the occurrence and exacerbation of T2DM, predictive models constructed using few variables have low power, while models including an excess of variables are complex and cumbersome, and are unsuitable for use in the clinic [41]. Most large studies have included individuals with specific ethnic or national backgrounds, and their findings are not generalizable to other populations [42]. Therefore, the existing DM prediction model did not provide fully satisfactory accuracy in the Korean population [11,12]. One recent study showed that the C-statistics for the models for DM risk at 10 years were 0.71 (95% CI, 0.70 to 0.73) for the men and 0.76 (95% CI, 0.75 to 0.78) for the women [12]. The use of artificial intelligence based technologies for disease prediction has facilitated the introduction of DL-based diabetes prediction models in recent years [19-22]. These results show that the performance of the DL-based prediction model for T2DM is favorable; however, compared with the existing model, the advantages are not very remarkable. A study using data from the Korean National Health and Nutrition Examination Survey found that the performance of DL-based prediction for T2DM was AUC 80.11 [21]. This result is an accuracy of about 80%, which is similar to the previous model. As a result of a study conducted based on electronic medical records of 8,454 subjects, the risk of DM for 5 years was similar to that of the traditional model [19].
We had conducted this study to address the limitations of previous studies. This study is particularly remarkable in that is has been based on a large cohort representative of the population of Korea. Various variables such as anthropometric parameters, personal behavior, past medical history, family history, and laboratory tests were utilized in model development. Additionally, long-term follow-up data for approximately 10 years were available for outcome evaluation. Moreover, careful statistical analysis facilitated the presentation of time dependent AUCs of the two models, and the clinical variables affecting the occurrence of T2DM in the DL-model. Consequently, both models provided reliable results. In particular, the DL-based model performed better performance than the conventional Cox model. The short-term predictive power of the DL-based model also demonstrated excellent performance, with an AUC as high as 0.877 in 5 years. The most important implication of this study lies in the development of a highly accurate DL-based prediction model using a model that is universally applicable to Korean adults aged over 40 years. This provides the considerable advantage of being able to easily and accurately assess the future yearly risk of developing diabetes in the nationwide population. In recent years, there has been a free ongoing trial service in Korea that offers predictive tests using this model for the future risk of diabetes in health checkup recipients who agree to undergo testing. In the future, if its feasibility is established, the service will be provided free of charge to all Koreans.
The Korean Diabetes Prevention Study is currently being conducted in Korea to evaluate the clinical utility of preventive interventions for high-risk patients with diabetes [43]. If the results are conducive, and independent evidence for the prevention of diabetes is established based on national screening projects, Korea will be able to provide a system for systematic screening of high-risk populations and preventive interventions. The provision of diabetes prediction systems to the entire national population based on artificial intelligence, and efforts for the dissemination of evidence-based interventions are rarely observed worldwide. Therefore, we believe that it is necessary to introduce the Korean model to global researchers, and to discuss the future impact on public health.
This study has certain limitations. First, the accuracy of the long-term prediction close to 10 years is lower than that of the short-term prediction of 5 years or less. Second, the inaccuracy of claims-based research may be debated. Third, since all subjects do not necessarily undergo national health checkups, certain errors may have been introduced. Most of the currently available variables have been included in the model; however, the adequacy of the type and numbers of the variables are difficult to estimate. For instance, the detailed classification of personal behavior and family history of chronic disease was difficult; it is possible that the influence of this variable was not accurately calculated. Additionally, the HR was significantly higher for the family history of DM in the conventional than the DL-based model. This is a notable limitation since no mechanisms were available to explain these results based on the current DL based model. Therefore, the results of this study did not completely shift the existing paradigm. We hope that these limitations may be addressed by determining outcomes for longer terms, more detailed clinical phenotyping, application of better analytical methodologies and reflecting the variables that have recently been updated. In particular, we speculate that the addition of individual genomic, microbiomic, and pertinent biomarker data will maximize its predictive power.
Despite these limitations, we successfully constructed a DL-based prediction model based on a representative nationwide cohort, which may easily and accurately predict the risk of T2DM in all members of the general population; we also demonstrated its good performance. This prediction model has already been used among some national health screening examinees in Korea. To the best of our knowledge, this is the first global instance of implementation of a DL-based diabetes prediction system for the entire national population. It is possible that the considerable burden of diabetes may be eventually reduced in Korea if evidence-based personalized preventive interventions are realized in future.
Supplementary Materials
Supplementary materials related to this article can be found online at https://doi.org/10.4093/dmj.2020.0081.
Notes
CONFLICTS OF INTEREST
No potential conflict of interest relevant to this article was reported.
AUTHOR CONTRIBUTIONS
Conception or design: I.J.C., S.E.L., H.J.C.
Acquisition, analysis, or interpretation of data: J.M.S., S.K.
Drafting the work or revising: S.Y.R., J.M.S.
Final approval of the manuscript: H.J.C.
FUNDING
None
Acknowledgements
The authors would like to thank professor Jeong-Taek Woo of Kyung Hee University for his exceptional discourse and inspiration, which encouraged us to conduct the present study.