By Alexander Dokumentov, PhD; Yassien Shaalan, PhD; Piyapong Khumrin, PhD, MD; Krit Khwanngern, PhD; Anawat Wisetborisut, PhD; Thanakom Hatsadeang, BSc; Nattapat Karaket, BSc; Witthawin Achariyaviriya, BSc; Sansanee Auephanwiriyakul, BSc; Nipon Theera-Umpon, PhD; and Terence Siganakis, MSc
This article discusses the emerging trends and challenges related to automatic clinical coding. We introduce an automatic coding system, which assigns short ICD-10 codes (restricted to the first three symbols, which define the category of the disease) based only on drugs prescribed to patients. We show that even with limited input data, the accuracy levels are comparable to those achieved by entry-level clinical coders as depicted by Seyed Nouraei et al.1 We also examine the standard method for performance estimation and speculate that the actual accuracy of our coding system is even higher than estimated.
Clinical coding involves assigning medical records universal codes such as ICD-9 explained by Melissa Wei et al;2 ICD-10 introduced next and further explained by Donna Cartwright;3 and in the upcoming ICD-11 discussed by Carla Smith et al.4 It supports comparability in the collection, processing, and presentation of health statistics. Such coding makes it easy to store, retrieve, compare, and analyze health information for evidence-based decision-making. The increase in complexity and granularity of the codes from one classification version to the next can lead to increasing difficulty in achieving accuracy in clinical coding. This increasing technical challenge can lead to performance degradation for health systems that operate activity-based funding models shown in the study of Pamela Baxter et al,5 which can result in financial losses, as explained by Charland Kim, Morgan Haefner, Lani Knight et al.6,7,8 Recent advances in machine learning (ML)—a branch of artificial intelligence based on the idea that systems can learn from data by identifying patterns and make decisions with minimal human intervention—and more specifically deep learning, as explained in Jürgen Schmidhuber’s study,9 which is a subset of ML techniques wherein its networks are capable of learning unsupervised from large amounts of data that is unstructured or unlabeled. It provides great potential to develop effective systems to partially automate the clinical coding process and support the sustainability of clinical coding activities.
This study was conducted between June 2019 and January 2020. We were granted an approval with number “6152/2019” from the Research Ethics Committee of the Faculty of Medicine at Chiang Mai University to use the anonymized data for the purpose of this study. We use ML to build two automatic coding systems. ML approaches learn and improve from experience without being explicitly programmed and make predictions on unseen data. Although there are various types of patient-related data, we have built a novel supervised ICD-10 prediction model using only prescribed drugs data. Among the reasons for using such data are:
- Prescribed drugs appear to be very informative to predict ICD-10 codes, as it is often the last step of the episode of care.
- The data is mostly complete (fewer missing data) per diagnosis.
- The problem is challenging, as drug association with the disease is not one-to-one (e.g., co-morbidity prescriptions).
Neural networks (NNs) are artificial networks consisting of multiple layers, each with many neurons (learning units) trying to simulate the human brain, as shown by Zenon Waszczyszyn.10 It has many layers of “neurons” that receive inputs that go through all the layers, and the output of one layer is fed into the next layer. We adopt NNs to extract drug-disease associations from data to predict ICD-10 codes. The use of NNs is driven by a few factors; the main one is the inherent high complexity of the problem due to the nonlinearity of relations found in drug-disease associations. NNs can automatically learn hidden intrinsic complex features without any manual hand-crafted feature engineering. Moreover, NNs are now thoroughly investigated and well established with a wide range of supporting software and well-written documentation.
To the best of our knowledge, this work is the first to address the ICD-10 coding prediction from this angle. However, previous related work to the prediction of diagnosis studied by Julia Medori et al, Svetla Boytcheva, and Keyang Xu et al11,12,13 relied mainly on discharge notes and using very specialized datasets that do not contain many diverse real-world complex cases. These techniques range from rule-based (e.g., coding frequency and gender-specific) using Naive Bayes classifiers, to text-based, as shown by Julia Medori, from free-text death certificates using term-based concepts (SNOMED CT) and employing a support vector machine (SVM) classifier as shown in Shihong Yue’s study.14 Moreover, in Jürgen Schmidhuber’s study, they used medical terminologies (UMLS) to formulate features to train their models, while in the study of Bevan Koopman et al,15 they matched ICD-10 codes to diagnoses extracted from discharge letters using a multiclass SVM. However, SVMs cannot support the case of extreme multiclass, multilabel problems that this paper tackles. On the other hand, Julia Medori incorporated more data sources (structured, semi-structured, and unstructured) and employed an ensemble method to integrate all modality-specific models to predict codes. However, this approach requires large amounts of data, suffers from high dimensionality, and was only tested on a very small subset of 32 frequent ICD-10 codes. In summary, these models are too specific to a small subset of the real-life clinical coding, and neither reflect the real complexity or the true figures of coding prediction accuracy.
We used clinical data (inpatient and outpatient datasets) from the electronic health records of Maharaj Nakhon Chiang Mai Hospital (Thailand), which was recorded between 2006 and 2019. Table 1 contains a few important statistics for each available dataset.
Our main task is predicting the set of ICD-10 codes assigned to a patient that belongs to a class of ML problems called multilabel, multiclass classification problems. We calculate the accuracy of the trained systems by randomly splitting the data into training and test (holdout) subsets. The training set (subset) is used to extract the knowledge (to train the systems), and the test set is used only to check the accuracy of the predictions similar to Ron Kohavi’s study16. For evaluation, we use the Jaccard similarity score to measure accuracy of the predictions. The Jaccard similarity score was introduced by Paul Jaccard’s study17 and is calculated as an average of scores for all cases in the test set, while one score is calculated as a ratio of two numbers: the number of correctly predicted ICD-10 codes and the number of ICD-10 codes in the union set of correct and predicted codes:
where N is the number of cases, is the set of correct ICD-10 codes for case , and is the set of predicted ICD-10 codes for case .
The main reason is that the Jaccard similarity score covers the most intuitive meaning as shown by Krzysztof Dembczyński et al18 for both types of ICD-10 coding errors: undercoding and overcoding.
For example, in Table 2, there are three rows. The Jaccard similarity score is calculated as an average over three scores: 0.25, 0.5, and 0.25 and equal to 1/3 (33.3 percent).
The first row in the dataset is giving an example of undercoding (when predicted codes are the subset of the correct codes), the second row is an example of overcoding, and the third case is a mixture.
However, when we predict the primary diagnosis (one code), we use a different accuracy measure:
For example, in Table 3, there are three rows, and, in two cases, the primary diagnosis is predicted correctly; thus, the accuracy is 2/3 (66.6 percent).
We use NNs to train two automatic coding systems. The structures of both NNs are the same and appear in Figure 1. In both cases, the neural networks predict only first three symbols of ICD-10 codes.
Both models are feedforward (FF). The reason these models are called feedforward is because information ﬂows through the function being evaluated from x (input), throtaugh the intermediate computations (hidden layer(s)) used to deﬁne f, and ﬁnally to the output y as explained by Han Jun et al.19 Our proposed NN model comprises two trainable layers. The input layer is of the size of number of drugs in the dataset. Since the number of drugs vary for inpatient and outpatient datasets, the size of the input layers differs respectively. For the inpatient dataset, the size of the input layer is 4,986 and for the outpatient is 3,008. The output layer is of the size of the number of the ICD-10 codes with three-letter prefixes in each dataset. For the inpatient dataset, the size of the output layer is 1,941 and for the outpatient is 1,751. The input is weighted in hidden layer by weights learned through the training process. Then, an activation function is needed to transform the sum of the weighted input to output of that layer. Rectified Linear Unit (ReLu) is a short piecewise linear function that will output the input directly if it is positive; otherwise, it will output zero as shown by Abien Fred Agarap.20 The choice of this activation function is because it overcomes the vanishing gradient problem (when gradients of the loss function approaches zero), allowing models to learn faster and perform better. We chose the hidden layer of our model to contain 600 neurons with ReLu activation. It is a piecewise linear function that will output the input directly if it is positive; otherwise, it will output zero. The dropout layer is set with a rate of 0.35 that follows the hidden layer.
Since the first NN is trained to predict sets of ICD-10 codes, the loss function used there is binary cross entropy as explained by Shie Mannor,21 while the second neural network, which needs to predict a single ICD-10 code, has categorical cross entropy as loss function. Both NNs used the Adam optimizer with a batch size of 2,048.
During the prediction phase, in the case of the first network, the set of outputs with output probability value greater than 0.5 is counted as the predicted ICD-10 codes. When no neurons output a value greater than 0.5, the neuron with maximal value will be counted as the only prediction. For the second network, a neuron with maximal value is always counted as the prediction.
Experimental Results and Discussion
The data was collected during the period between 2006 and 2019 from the Maharaj Nakorn Chiang Mai Hospital medical record system. Python Statistical Software version 3.522 was used for all computations. Python deep learning Keras library23 was employed for implementation. Pre-processing, building the model (layers and activation functions), and training were performed on a 2-GPU (GTX 1060 6GB) machine. Inpatient medications comprises 5 million unique medications with 13.47 average prescriptions per patient. Outpatient comprises 3 million unique medications and 3.22 prescriptions on average. The data was extracted by grouping all prescriptions related to each episode-of-care ID. Data was grouped by each episode of care. The only pre-processing for the medications dataset was to remove canceled prescriptions and to binarize into multilabel sparse vectors.
Table 4 presents the prediction accuracy for both inpatient and outpatient datasets, respectively.
As mentioned above, for accuracy testing, we followed a well-established procedure explained in Ron Kohavi’s study, which we believe gives a very conservative estimate of accuracy performance. We speculate that, in our case, the actual accuracy is higher by about 15 percent to 20 percent. In fact, the existence of label noise in the dataset has many negative potential consequences, such as increasing the model complexity and the degradation of the accuracy of predictions as shown by Benoît Frénay et al.24 To get a rough estimation of the accuracy over a noisy test set, we assume that there is a correct test set and there is a noisy test set . The noisy test set is the test set where every case coding is randomly changed: some correct codes are removed, and some new random incorrect codes are added. The whole procedure is performed that way that the Jaccard similarity between and is about 70 percent (based on Haefner, we can speculate that the datasets we have contains at least 30 percent of errors).
Suppose the predicted set has a Jaccard similarity score of 60 percent with the correct test set . Thus, every case t from has, on average, 60 percent of labels guessed correctly (out of union of labels from and ) by the corresponding prediction . Out of that 60 percent of correct labels, only 42 percent are deemed correct in the set (since has noise). If we assume that probability to guess noise labels from by the algorithm is very low, then if the algorithm scores actual accuracy of 60 percent, it will only show an accuracy of 42 percent on the noisy dataset .
Thus, the figures in Table 4 are very conservative estimates of the real prediction performance and, according to the explanations above, they reflect approximate accuracy of as high as 55 percent to 65 percent over clean test sets.
Limitations of the Study
ML approaches suffer from some limitations for practical applications. One of them is that new, complex, or rare cases cannot be handled directly by the system, as it has no prior familiar examples to learn from. This means that even if a much more advanced system that is accepted for production use, it will only replace a limited amount of coding work, leaving the most complex, new, and borderline cases for professional medical coders to confirm.
Other limitations are more on the technical side of the approach. One of them is that assigned ICD-10 codes are treated as a set, which is not always the case, as some ICD-10 codes can work as modifiers. Another limitation is that only first three symbols of ICD-10 codes are used for prediction. This limitation comes from restricted data we had for prediction (we are considering prediction full ICD-10 codes in the future). The last limitation is that all data rows are considered without any chronological order while, in practice, predictions are done using only past data.
Our internal evaluation of the system shows that, currently, the system allows for short-listing and automatically recommending relevant ICD-10 codes. This can improve the performance and accuracy of professional medical coders and save both time and effort in the process. Another possible application of the system is ICD-10 auditing, which is a critical procedure carried out by medical authorities to assess the quality of coding in health organizations. This can be done by the system through short-listing complex and extreme cases for human investigation.
Alexander Dokumentov, PhD, (email@example.com) is lead data scientist at Growing Data in Melbourne, Australia.
Yassien Shaalan, PhD, (firstname.lastname@example.org) is lead data scientist at Growing Data in Melbourne, Australia.
Piyapong Khumrin, PhD, MD, (email@example.com) is an assistant professor in the Biomedical Informatics Center, Faculty of Medicine, at Chiang Mai University in Chiang Mai, Thailand.
Krit Khwanngern, PhD, (firstname.lastname@example.org) is an assistant professor in the Department of Surgery, Faculty of Medicine at Chiang Mai University in Chiang Mai, Thailand.
Anawat Wisetborisut, PhD, (email@example.com) is an assistant professor in the Department of Family Medicine, Faculty of Medicine at Chiang Mai University in Chiang Mai, Thailand.
Thanakom Hatsadeang, BSc, (Thanakom.firstname.lastname@example.org) is a researcher in the Department of Computer Engineering, Faculty of Engineering at Chiang Mai University in Chiang Mai, Thailand.
Nattapat Karaket, BSc, (email@example.com) is a researcher in the Department of Electrical Engineering, Faculty of Engineering at Chiang Mai University in Chiang Mai, Thailand.
Witthawin Achariyaviriya, BSc, (firstname.lastname@example.org) is a researcher in the Biomedical Engineering Institute at Chiang Mai University in Chiang Mai, Thailand.
Sansanee Auephanwiriyakul, BSc, (email@example.com) is a researcher in the Department of Computer Engineering, Faculty of Engineering at Chiang Mai University in Chiang Mai, Thailand.
Nipon Theera-Umpon, PhD, (firstname.lastname@example.org) is an assistant professor in the IT Department of Maharaj Nakhon Chiang Mai Hospital in Chiang Mai, Thailand.
Terence Siganakis, MSc, is the CEO of Growing Data in Melbourne, Australia.
1. Nouraei, Seyed, Virk, Jagdeep, Hudovsky, Anita, Wathen, Christopher, Darzi, Ara, and Parsons, Darren. “Accuracy of clinician-clinical coder information handover following acute medical admissions: implication for using administrative datasets in clinical outcomes management.” Public Health, no. 2 (2016):38:352-62.
2. Wei, Melissa, Luster, Jamie, Chan, Chiao-Li, and Min, Lillian. “Comprehensive review of ICD-9 code accuracies to measure multimorbidity in administrative data.” BMC Health Services Research, (2020).
3. Cartwright, Donna. “ICD-9-CM to ICD-10-CM Codes: What? Why? How?.” Adv Wound Care (New Rochelle), no. 2 (2013):10:588–592.
4. Smith, Carla, Sue Bowman, and Julie A. Dooling. “Measuring and Benchmarking Coding Productivity: A Decade of AHIMA Leadership.” Journal of AHIMA, (2019).
5. Baxter, Pamela, Sarah Hewko, Kathryn Pfaff, Laura Cleghorn, BJ Cunningham, Dawn Elston, and Greta Cummings. “Leaders’ Experiences and Perceptions Implementing Activity-Based Funding and Pay-for-Performance Hospital Funding Models: A Systematic Review.” Health Policy, (2015).
6. Kim, Charland. “Measuring Coding Accuracy and Productivity in Today’s Value-Based Payment World.” Journal of AHIMA, (2017).
7. Haefner, Morgan. “Higher coding productivity linked to a 25.4% decrease in accuracy.” Becker’s Hospital Review, (2017).
8. Knight, Lani, Rachel Halech, Corrie Martin, and Lachlan Mortimer. “Impact of changes in diabetes coding on Queensland hospital principal diagnosis morbidity data.” (2011).
9. Schmidhuber, Jürgen. “Deep learning in neural networks: An overview.” Neural Networks, no. 61 (2015):85-117.
10. Waszczyszyn, Zenon. “Fundamentals of Artificial Neural Networks.” Neural Networks in the Analysis and Design of Structures, (1999):1-51.
11. Medori, Julia, and Cédrick Fairon. “Machine learning and features selection for semi-automatic ICD-9-CM encoding.” (2010).
12. Boytcheva, Svetla. “Automatic Matching of ICD-10 codes to Diagnoses in Discharge Letters.” (2011).
13. Xu, Keyang, Mike Lam, Jingzhi Pang, Xin Gao, Charlotte Band, Piyush Mathur, Frank Papay, Ashish K. Khanna, Jacek B. Cywinski, Kamal Maheshwari, Pengtao Xie, and Eric P. Xing. “Multimodal Machine Learning for Automated ICD Coding.” (2019).
14. Yue, Shihong, Li, Ping, Hao, Peiyi. “SVM classification: Its contents and challenges.” Applied Mathematics, no. 3 (2003):(18):332–342.
15. Koopman, Bevan, Guido Zuccon Anthony, Nguyen Anton Bergheim, and Narelle Grayson. “Automatic ICD-10 classification of cancers from free-text death certificates.” (2014).
16. Kohavi, Ron. “A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection.” International Joint Conference on Artificial Intelligence. no.2 (2001):14:1137-1143.
17. Jaccard, Paul. “Étude Comparative de la Distribution Florale Dans Une Portion des Alpes et des Jura.” Bulletin de la Société vaudoise des sciences naturelles. no. 37 (1901): 547–579.
18. Dembczyński, Krzysztof, Willem Waegeman, Weiwei Cheng, and Eyke Hüllermeier. “On label dependence and loss minimization in multi-label classification.” Machine Learning no. 2 (2012):88:5-45.
19. Han, Jun, Moraga, Claudio, Sinne, Stefan. “Optimization of feedforward neural networks.” Engineering Applications of Artificial Intelligence, no. 2 (1996):9:109-119.
20. Agarap, Abien Fred. “ Deep learning using rectified linear units (relu).” (2018), arXiv preprint arXiv:1803.08375.
21. Mannor, Shie, Dori Peleg, and Reuven Rubinstein. “The cross entropy method for classification.” International conference on machine learning (2005):561-568.
22. Python 3.5, (2015), https://www.python.org/downloads/release/python-350/.
23. Keras 2.2.4 (2018), https://github.com/keras-team/keras/releases/tag/2.2.4.
24. Frénay, Benoît, and Michel Verleysen. “Classification in the presence of label noise: a survey.” IEEE transactions on neural networks and learning systems no. 5 (2013):5:845-869.