by Robert Hoyt, MD, FACP; Steven Linnville, PhD; Stephen Thaler, PhD; and Jeffrey Moore, PhD
Following the passage of the Health Information Technology for Economic and Clinical Health (HITECH) Act of 2009, electronic health records were widely adopted by eligible physicians and hospitals in the United States. Stage 2 meaningful use menu objectives include a digital family history but no stipulation as to how that information should be used. A variety of data mining techniques now exist for these data, which include artificial neural networks (ANNs) for supervised or unsupervised machine learning.
In this pilot study, we applied an ANN-based simulation to a previously reported digital family history to mine the database for trends. A graphical user interface was created to display the input of multiple conditions in the parents and output as the likelihood of diabetes, hypertension, and coronary artery disease in male and female offspring. The results of this pilot study show promise in using ANNs to data mine digital family histories for clinical and research purposes.
One of the most significant scientific achievements of the past two decades was the completion of the Human Genome Project in 2003.1 As a result, genetic links to common diseases such as age-related macular degeneration, multiple sclerosis, and Alzheimer’s disease have been established.2 Despite the treasure trove of data generated from this effort and the decreasing cost of whole-genome sequencing, multiple ethical, legal, and social challenges exist. Furthermore, because of the complexity of the human genome, significant questions remain regarding how to interpret the results. Genetic tests are best for single gene disorders with high penetrance, but they account for only a tiny percentage of chronic disorders and are therefore poor tests for screening. The reality is that most chronic diseases are polygenic disorders that have low penetrance and are influenced by multiple environmental factors. Dr. Eric Green, the director of the National Human Genome Research Institute, stated in 2011, “At the moment, the biggest challenge is in data analysis. We can generate large amounts of data very inexpensively, but that overwhelms our capacity to understand it. At the other end of the spectrum, we need to infuse genomic information into medical practice, which is really hard. There are issues around confidentiality, education, electronic medical records, how to carry genomic information throughout lifespan and make it available to physicians.”3
While the challenges of the Human Genome Project are being addressed and clarified, some experts recommend using the routine family health history to predict future diseases and conditions. Some have referred to the family history as the “first genetic test.”4 Additionally, the information from family histories has been shown to be important for investigation of diseases with a genetic component.5–7 For most chronic diseases, a positive family history increases the relative risk of disease in offspring two to five times over the baseline risk, particularly if more than one first-degree relative has the condition and the age of onset is early.8
Prior to the adoption of electronic health records, obtaining a family history was infrequent and time consuming, and the resulting data were not structured or computable. The situation changed with the Health Information Technology for Economic and Clinical Health (HITECH) Act of 2009, which established a reimbursement program for eligible professionals (EPs) and eligible hospitals (EHs) that used certified electronic health records (EHRs) and complied with meaningful use objectives.9 As of December 2014, 509,250 EPs and 4,801 EHs had registered for the Medicare and Medicaid EHR reimbursement program.10
In stage 2 meaningful use, one of the menu objectives is to “record patient family health history as structured data,” and the measure standard is “more than 20 percent of all unique patients seen by the EP during the EHR reporting period have a structured data entry for one or more first-degree relatives.” Data standards required to support structured data are the HL7 Pedigree Standard and the Systematized Nomenclature of Medicine–Clinical Terms (SNOMED-CT).11 Therefore, digital family histories are expected to emerge as part of EHRs, but what will be done with the data?
Digital family histories and whole-genome sequencing should be considered forms of clinical decision support, which is part of the EHR of the future. The goal would be to alert and inform clinicians and patients about the probabilities of future diseases and conditions. Data mining tools would be necessary to link a knowledge base with actual patient information in order to either describe a condition or make a prediction. The two main categories of data mining are supervised machine learning and unsupervised machine learning. In the former, one assumes that the data classes are known ahead of time, whereas in unsupervised learning the system is presented with data and develops classes or clusters. Supervised learning can perform predictive modeling based on dependent and independent variables, similar to logistic regression.12
One interesting type of data mining involves the use of artificial neural networks (ANNs) or neural networks, which are capable of both supervised and unsupervised machine learning. Neural networks use computational units that are analogous to the biological neuron. Such computational neurons are connected unidirectionally, may operate in parallel, and behave as simple switching elements, which fire when supplied a threshold level of integrated input signal. The neuron can receive multiple inputs (similar to dendrites), which are processed and weighted to generate a single output (analogous to an axon). The overall network may be viewed as a nonlinear mathematical transformation that maps input to output patterns.
In the supervised learning model, training patterns are repeatedly propagated through the net to produce outputs differing from those appearing in the training data. Such output error serves as the basis of a backward propagating wave that iteratively corrects connection weights until the net’s output pattern closely matches the patterns represented in the data.13 Neural networks are now in mainstream use, with common applications in voice and handwriting recognition. Neural networks have been applied to the field of medicine in four ways: predictive modeling, signal processing, diagnosing, and prognosticating. Neural networks have been used in almost every medical subspecialty field, such as radiology (image pattern recognition), cardiology (electrocardiogram analysis), and neurology (electroencephalogram analysis).14
We previously reported our experience with a digital family history collected from a unique cohort of older men who were Vietnam-era repatriated prisoners of war and a comparison group. This article builds on the previous study, published in 2013.15 A digital family history of first-degree relatives was created using an online survey tool. The participant who took the survey reported on the health of parents, siblings, and children, and this information was exported to a spreadsheet, facilitating analysis with cross-tabulation.16 This effort was labor intensive, so it was postulated that neural networks might be a means of mining this rich data. This pilot study reports on the results of evaluating the digital family health history database with neural networks, as compared to cross-tabulated results.
The study population consisted of 319 male Vietnam-era veterans, which included 253 who were repatriated prisoners of war as well as 66 in a comparison group, matched for gender, age, education, and combat roles in Vietnam. The average age at the time of survey completion was 70 ± 6 years. These individuals visited the Robert E. Mitchell Center for Prisoner of War Studies, located in Pensacola, Florida, on a near-annual basis. The program has been in existence since 1973, with some repatriated prisoners of war having 42 years of longitudinal physical and psychological data.17 This project was approved by the institutional review board, and all patients signed a consent form. Of 447 potential participants who were e-mailed, 319 (71 percent) agreed to complete the survey.
The survey data included information on 2,412 individuals from three generations. With 709 children excluded, 1,703 male and female adults were included in the results. The data included 319 sets of parents (638 individuals) and 1,065 male and female offspring.
The 319 adult male survey participants reported on the health of themselves, their parents, their siblings, and their children. Figure 1 shows the breakdown of participants by generation and gender. Children were excluded because the pilot neural network was designed to include only the parents and the parents’ male and female offspring.
With children excluded, the baseline prevalence of type 2 diabetes (DM) in parents and their offspring was 10 percent, that of hypertension (HTN) was 31 percent, and that of coronary artery disease (CAD) was 6 percent.
To review the survey content and face validity, we convened an expert panel consisting of a university-based geneticist, a private genetic counselor, a neuropsychologist, and an experienced internal medicine physician to determine the appropriate survey design and the selection of common medical and psychiatric diseases with a genetic component. A literature review was also undertaken to determine the availability and relevance of existing family history questionnaires. We also benchmarked our efforts with the recommendations made by the 2008 American Health Information Community’s Family Health History Multi-Stakeholder Workgroup.18 A commercial survey instrument (SurveyMonkey) was used to create the web-based survey.19 The survey had the following sections:
3. Mother’s health
4. Father’s health; questions identical to the mother’s health section.
5. Sibling health; questions identical to the mother’s health section.
6. Children’s health; questions identical to the mother’s health section.
The data collection period was from May 2012 to June 2013. The collection tool was online, so participants could complete the survey at home, or they could complete the survey in Pensacola, Florida (a midsized city in the Southeast region of the country), during their annual medical follow-up examination.
Further details regarding how the survey was created, tested, and privacy protected were reported in our 2013 study.20
Training data were largely converted to Boolean format, with 1s and 0s respectively denoting presence or absence of a disease, whereas other data were represented as real numbers in the range between 0 and 1.
Using a proprietary neural network training package called PatternMaster, a thousand-trial ANN architecture, involving randomly generated hidden-layer architectures and learning parameters, was rapidly generated, trained, and tested on the basis of generalization accuracy using set-aside data. During this automated testing, a separate ANN learned to map all network and training parameters to the anticipated generalization accuracy. This latter net was then stochastically interrogated to determine the network architecture, learning rate, and momentum that provided the most accurate predictions.21 This optimal net was trained to a root-mean-square prediction error of 0.01 and exported both as a spreadsheet, whose cells functioned as neurons, and as a C-code function.
The spreadsheet-based neural net allowed for transparency and rapid experimentation. The latter feature proved valuable in determining how best to vary free input parameters after certain parameters were chosen to be kept fixed in the model. To this end, we developed two approaches and wrote macros to systematically vary the free inputs. The first of these methodologies, called MonteCarlo, varied free parameters via a “loaded” computational coin flip that reflected the disease’s prevalence in the training data. Therefore, to simulate a condition occurring 20 percent of the time within the data, the disease parameter was set to 1 if a random number, in the range , fell below 0.2. The other approach, called Variational, extracted Boolean values by randomly accessing a row of the original training data and extracting values from relevant data fields.
In the end, we found that both approaches gave similar results, but the overhead of Microsoft Excel significantly slowed down the stochastic interrogation of the model to 5 to 15 seconds. To overcome the issue of execution speed, both the Excel macros and ANN C module were converted to C# and compiled into an executable having a more intuitive graphical user interface (GUI). Using the GUI, the presence or absence of a disease in either parent could be indicated by a check or a blank check box, respectively. Floating parameters could be indicated using the third state of these boxes, shown as solid blue in Figure 2.
In the pilot phase we opted to study only the offspring (sons and daughters) of the parents. A variety of chronic diseases and conditions with a genetic component could be used as input for the mother and/or father, and the risk of diabetes (DM), hypertension (HTN), and coronary artery disease (CAD) would be displayed as output for male and female offspring. Figure 2 provides an example of the output based on input consisting of common medical problems such as diabetes and hypertension.
From the family history survey data, individuals were examined through cross-tabulation of a single disease type (DM, HTN, or CAD) at a time; otherwise, cross-tabulation of multiple diseases and/or any other variable resulted in too few data points for statistical analysis. Cross-tabulations were done of the effect on the offspring of mother having the disease, the father having the disease, both parents having the disease, or neither parent having the disease. Using these cross-tabulations, a series of odds ratio (OR) analyses that compared having the disease (either parent or both positive) with not having the disease (both parents negative) were then conducted with the parental effect of the disease type on the male and female offspring. This process also included calculating the likelihood ratios, disease-type base rates, and 95 percent confidence intervals (CIs) in comparison to the ANN output for each disease type.
To elucidate further the difficulty encountered in comparing the cross-tabulation outputs because of the small number of data points with the ANN outputs, we were forced to average ANN outputs for each disease type. The ANN GUI required smoking status (never smoked, quit smoking, and current smoker) be marked in the GUI, which would lower the number of parents analyzed. So, for example, if we evaluated parents who never smoked and who were not diabetic, and compared the ANN output to the cross-tabulation, we found in the cross-tabulation only 20 parents out of a pool of 638 that fit the criteria. Those 20 parents had 20 daughters and 20 sons, so the sample size was too small for statistical analysis, making comparison with the ANN output difficult. We therefore had to average the neural network output across all smoking categories (i.e., the arithmetic mean of the 3 smoking outputs) for each of the results to maintain a sufficient number of data points in the analyses of the cross-tabulations.
For the male offspring, the mother having the disease significantly affected them for DM (OR = 4.11), for HTN (OR = 2.33), and for CAD (OR = 5.19). This finding means a fourfold odds increase in DM, a twofold odds increase in HTN, and a fivefold odds increase in CAD. The odds increased when both parents had the disease, particularly for HTN (OR = 5.16) and CAD (OR = 10.90). For the female offspring, although the mother having the disease significantly affected them for DM (OR = 10.43) and for CAD (OR = 4.46), just the father having the disease significantly affected them for DM (OR = 8.44) and for CAD (OR = 5.3). Having both parents with the disease significantly affected them for DM (OR = 14.78) and for HTN (OR = 7.90). No instances of both parents having CAD were found among the female offspring.
Confidence intervals were wide because of the small sample size, and likelihood ratios were small, reflecting the small sample size and the lower-than-average prevalence of chronic diseases in this unique cohort. More than half (67 percent) of the averaged neural network results fell within the 95 percent CIs of the base rates for each of the identified diseases.
We also compared the family history inheritance trends reported in the literature with our results. Odds ratios and neural networks demonstrated that the largest increase in diabetes among offspring occurred when either the mother had DM or both parents had DM. These results reflect findings reported in the literature. Although the neural network result for DM in male offspring was 0.44, the literature suggests that it may be as high as 0.50 in male and female offspring when both parents are diabetic, so it is possible that the neural network produced more accurate results.22
To our knowledge, this is the first report of data mining of a digital family history database with the use of a neural network simulation. Our model is based on training for multiple inputs, but the output was limited to only three common disease entities, chosen because of their high prevalence and widely reported genetic component. A fully operational model would include more outputs and perhaps the ability to incorporate risk factors of both the parents and offspring. The results using neural networks correlate in general with cross-tabulation results and the medical literature, but are limited by the small sample size and low prevalence of chronic disease.
The evidence thus far indicates that inclusion of the family history has several potential benefits in healthcare. The family history can identify genetic trends, even before specific gene variants or single nucleotide polymorphisms are identified. For example, a family history of chronic obstructive pulmonary disease (COPD) is a strong risk factor for the development of COPD in offspring, in the absence of any culprit genes identified thus far.23 Also, evidence suggests that smoking increases the risk of developing type 2 diabetes in the individual24 and fetal exposure to smoking by either parent increases the risk of obesity and type 2 diabetes downstream in adult female offspring,25 presumably through an epigenetic mechanism.26 Using the neural networks model, we demonstrated a twofold increase in diabetes in male offspring if the mother or both parents smoked. Because of the small sample size, we were not able to reliably compute the same with cross-tabulation.
The family history should also assist in population and public health, particularly in assessing the future risk of cancer and common chronic diseases, which have a genetic component.27–30 Of note, when both parents were negative for DM, HTN, or CAD the inheritance in male and female offspring was lower than the base rate, demonstrating the high specificity of the family history, which has been reported.31 Furthermore, the family history should aid in patient education because studies have shown that patients frequently have an inaccurate idea of their future risk of cancer based on their family history.32 Lastly, with the movement toward “personalized medicine” and “precision medicine,” both genomic sequencing and data mining of the family history are likely to be helpful in tailoring medical treatments.33 The use of family history as clinical decision support is in its infancy and to our knowledge is not available as part of any commercial EHR system. All previous research using family histories as clinical decision support has involved standalone programs, not integrated with EHRs.34
Limitations of the family history should be pointed out. Collecting and maintaining a family history takes time, although using patient portals to input patient histories may lessen the burden on clinicians. Family histories may be inaccurate and subject to recall bias and may be limited by a patient’s low educational status or poor family communication. The National Institutes of Health held a conference in 2009 regarding the role of the family history in improving health. Among the conclusions was that the use of family histories for predicting common conditions has low sensitivity and predictive ability but high specificity (that is, it is better for ruling out conditions).35 Additionally, evidence suggests that knowing the family history may have only a modest effect on changing behavior.36
The actual database we used for training the neural networks also had a limitation. The participants who took the survey were male Caucasians with a high socioeconomic status and a low prevalence of common chronic diseases. Also, there were significantly fewer female siblings than male siblings, for unknown reasons. Importantly, the database included 2,415 individuals, but when multiple filters were applied, the actual sample size available for data mining was frequently small.
Neural networks provide an interesting alternative to other prediction models such as logistic regression. Both can be utilized for dichotomous outcomes. Neural networks are not limited by a constrained mathematical relationship between the dependent and independent variables, and they can therefore model complex nonlinear relationships. Our evaluation of neural networks was limited by choosing single disease entities in the parent, such as diabetes, without other common comorbidities, which is not realistic. Neural networks also have limitations such as the requirement of significant computational resources and the potential for model “overfitting”; also, the model development tends to be empirical.37 Moreover, in a study comparing logistic regression with ANNs, Clermont et al. noted that the sample size needed to be in the range of 1,200 for adequate prediction from either method.38 However, evidence suggests that neural networks can be very accurate, even with small data sets, but must be calculated correctly.39 This study used a Monte Carlo simulation method, in which thousands of additional calculations were performed to improve accuracy.40 As noted in the results section, neural network predictions regarding the prevalence of DM in offspring with a mother or both parents having DM closely matched the results found in the medical literature. Therefore, neural networks may actually be more accurate than cross-tabulations for small data sets.
An interesting new informatics development is the HL7 standard known as FHIR (Fast Healthcare Interoperability Resources). This standard will allow sharing of clinical decision support and the creation of applications (apps) that interact with EHRs. One of the FHIR resources involves family history, so apps could be developed that mine the family history data as a form of clinical decision supported linked to the EHR by an open application programming interface.41, 42
The preliminary data from this pilot study provide evidence that neural networks may be valuable as a means to mine data from family histories for clinical and research purposes. For this approach to be used clinically, data standards such as SNOMED-CT must be in place, along with a means to integrate data with the electronic health record. Neural network software could be hosted remotely on a server and accessed through web services. Another option would be a family history analytical application that utilizes the new FHIR standard. From a research perspective, we believe that if neural networks are applied to a very large digital family history of patients reflecting the population at large, this data mining technique may uncover genetic trends heretofore unrecognized.
In the future, clinicians will likely be able to combine family history data, genomic data, and phenotypic data from the electronic health record into a more accurate method of disease prediction and personalized medicine. Further studies are warranted on larger and more typical patient cohorts to validate the accuracy of neural networks for data mining digital family histories and to establish causal relationships to chronic disease.
Dr. Thaler was supported under a grant from the Robert E. Mitchell Foundation in Pensacola, FL.
Robert Hoyt, MD, FACP, is the director of the Health Informatics Program at the College of Science, Engineering and Health at the University of West Florida in Pensacola, FL.
Steven Linnville, PhD, is a research psychologist at the Robert E. Mitchell Center for Prisoner of War Studies in Pensacola, FL.
Stephen Thaler, PhD, is the Founder and Chief Scientist at Imagination Engines in St. Charles, MO.
Jeffrey Moore, PhD, is a neuropsychologist at the Robert E. Mitchell Center for Prisoner of War Studies in Pensacola, FL.
21 Device for the Autonomous Generation of Useful Information. US Patent 5,659,666, issued August 19, 1997.
Robert Hoyt, MD, FACP; Steven Linnville, PhD; Stephen Thaler, PhD; and Jeffrey Moore, PhD. “Digital Family History Data Mining with Neural Networks: A Pilot Study.” Perspectives in Health Information Management (Winter 2016): 1-14.