Abstract
Big data (BD) is of high interest for research and practice purposes because it has the potential to provide insights into the population served and healthcare practices. Much progress has been made in collecting BD and creating tools for big data analytics (BDA). However, healthcare organizations continue to experience challenges associated with BD characteristics and BDA tools. Utilization of BD impacts current decision-making, planning, and future use of artificial intelligence (AI) tools, which are trained on BD. This qualitative study focused on better understanding the reality of BD and BDA management and usage by healthcare organizations. Six structured interviews were conducted with individuals who work with healthcare BD and BDA. Findings confirmed the known challenges associated with BD/BDA and added rich insights into the structural, operational and utilization aspects, as well as future directions. Such perspectives are valuable for education and improvements in BD/BDA management and development.
Keywords: big data, big data analytics, health records, digital data, population health, artificial intelligence
Introduction
The implementation of electronic health records (EHRs) and widespread information systems and applications for providers, consumers, and other parties have led to tremendous growth of electronic health data. The current sources of data include mostly textual content, which can be structured, semi-structured or unstructured. They also include videos, audios, and images that constitute multimedia. They can come from a variety of platforms such as machine-to-machine communications, social media sites, sensor networks, cyber-physical systems, and Internet of Things (IoT).1 These platforms begin to define big data (BD) because they make us think about size, volume, complexity, and heterogeneity of the data emanating every second from a variety of devices.
BD arrived sooner than the development of appropriate and efficient analytical methods for its analysis. In addition to the structured data, BD includes massive volumes of heterogeneous data in unstructured text, audio, video, and other formats, and so is not amenable to the inferences of statistical methods that are used for analyzing numerical structured data. Unstructured BD requires new tools for predictive analytics. In addition, there is a need for computationally efficient algorithms to handle the heterogeneity, noise, and massive size of structured BD. These are ways to dispel and/or avoid potential spurious correlations.
Artificial intelligence (AI) and data analytics are top technology priorities as they capitalize on sustainability through data analytics and adaptive AI.2 For over a decade, Mayer-Schönberger and Cukier encouraged datafication of BD, where essentially, virtually anything is transformed into useful data (insights) by documenting, measuring, and capturing digitally.3 Van Dijck asserted that the future of BD and big data analytics (BDA) will lie with machines, where data will be generated, shared, and communicated among data networks.4 After a decade of progress, much of the structured and unstructured data stored in EHRs can be analyzed with the use of natural language processing (NLP) and machine language processing (MLP) algorithms, which can unlock the value of the text and galvanize the extraction of the hidden insights and connectors.1 Transforming unstructured text into real patient insights holds great potential for improving health outcomes. Use of AI and BDA for clinical and non-clinical applications in healthcare has great potential, however, the majority of healthcare organizations have yet to reach the full benefits of their BD. This highlights the need to better understand the status quo of how big data is being handled and analyzed by healthcare organizations. What are some of the ways big data is being used and what are some of the challenges faced by healthcare organizations when it comes to working with big data? A deeper dive into how organizations use big data, how much they invest in big data technologies, and what challenges they experience creates an opportunity to identify and share some best practices, as well as identify potential gaps. Where the findings are translated into real patient insights and where such knowledge fosters better health outcomes, there may be opportunities for positive change in terms of improving population health, addressing health inequalities, improving operations, and reducing healthcare costs.
Background and Significance
Big Data
BD refers to data sets that are so large or complex with high volume, high velocity, and high variety that they cannot be processed by traditional data processing software in a reasonable amount of time, thus, requiring advanced techniques and technologies for management and analytics.5,6,7,8 BD can be described by characteristics such as volume, variety, velocity, variability, veracity, and value.
BD is inherently defined by big volume.9 The quantity of generated and stored data is usually reported in multiple terabytes and petabytes – where a terabyte stores enough data to fit on 1500 CDs or 220 DVDs. A terabyte of data would store approximately 16 million Facebook photographs. The volume of data in healthcare continues to grow because information is increasingly gathered not only systematically in systems used by hospitals, pharmacies, laboratories, insurance, research institutions, or genetic databases, but also by numerous information sensing IoT devices used by providers, patients, and other parties. The size of the data is believed to account for its value as well as its potential insight. Volume-related challenges are related to storage and data management technologies.
The type and nature or the structural heterogeneity of the data describes its variety.9 Structured data, mostly tabular data, found in spreadsheets and relational databases constitute about 20 percent of healthcare data.10 Unstructured data includes mostly text, images, audios, and videos. Semi-structured data may or may not conform to strict standards and include textual language for Web data exchange, called Extensible Markup Language (XML), that deploys user-defined data tags to make them machine readable. BD variety becomes even more complex given the diverse sources and formats, requiring that data from those sources be connected, matched, cleansed, and transformed.
At the heart of big data is velocity, which measures the rate of data generation and the speed at which the data is analyzed and acted upon to meet the demands and challenges that lie in the path of growth and development of organizations.9 Smart phones, digital sensors, and other devices, using mobile apps produce enormous and useful information about customers (or patients) that include geospatial location, demographics, buying and viewing patterns, and even physical activity or other health indicators tracked by mobile apps. These types of data can be analyzed in real time to harness real-time intelligence.
Another dimension of BD is variability, which implies the inconsistency or variation in the data flow (whereas velocity shows periodic peaks and troughs).9 Variability can hamper processes that manage BD.
Veracity reflects the “truthfulness” of data and was added as BD characteristics by IBM, given their specialization in removing and replacing BD errors.11 Addressing the imprecision and uncertainty becomes relevant for BD because of the inherent unreliability in certain data sources. The quality of captured data may vary tremendously, thus affecting the accurate analysis and results.
Lastly, BD is generally associated with value, which means that when large volumes of BD are analyzed, it is possible to extract high value from them.8 The original form of data has low value, but the information identified through its analysis can make a difference in its value. For that to happen, data should be relevant and of high integrity.
Big Data Analytics
BDA involves the analysis of BD. It is during this process that the value of big data for decision support and business intelligence is realized. Given BD characteristics, BDA cannot be derived by simple statistical analysis.12,13 In fact, use of advanced BDA tools and extremely efficient, scalable, and flexible technologies are necessary to efficiently manage and analyze the substantial amounts and variety of data.1,14 Technologies such as NoSQL Databases, BigQuery, MapReduce, Hadoop, WibiData, and Skytree have been in use for more than a decade.15 AI tools such as Microsoft Power BI, Microsoft Azure Machine Learning QlikView, RapidMiner, Google Cloud AutoML, or IBM Watson Analytics are offering greater value in BDA. For example, Microsoft Power BI was successfully used to detect specific antenatal data for babies small for gestational age (SGA) and monitor them through a dashboard, thus allowing clinicians to intervene and plan delivery as necessary.16
BD management entails both the processes and the associated technologies that allow for the acquisition, storage, and retrieval of data, which can be done in three stages: acquisition/recording; extraction, cleaning, and annotation; and integration, aggregation, and representation.17,18 Analytics involves the techniques applied in analyzing and acquiring intelligence from BD and can be completed in two stages: modeling and analysis; and interpretation. It becomes imperative that processing and management should be efficient enough to expose new knowledge in a timely manner, which is crucial for capitalizing on emerging opportunities, in providing a competitive edge, as well as rich business intelligence used to differentiate the organization, increase visibility, flexibility, and responsiveness to environmental changes.19,20,21,22,23 The allure in healthcare BDA is the ability to examine and apply the patterns that emerge from various and vast amounts of healthcare data to predict trends in population health and ways to improve it, while limiting costs. BDA benefits are already visible in reduced administrative costs, improved clinical decision support, better care coordination, reduced fraud and abuse; as well as improved patient wellness.24 Adoption of mHealth, eHealth and wearable technologies will push the increase in BD volume. Increased integration of such data with EHRs, imaging, patient generated data, or sensor data create even greater opportunities to leverage BD in healthcare.
Much of the BD and BDA research demonstrates success in use of BD and BDA tools such in monitoring SGA babies, response to COVID in Taiwan, or use of BD in mental health care.16,25,26 One study also highlights issues with big data privacy [27] (Golbus, W Nicholson & Brahmajee 2020.)27 Other studies help in understanding BD and BDA concepts through reviews, analyses, and summaries.19,28,29 In our study, we focused on the healthcare organizational structure regarding big data, the approach in integrating big data into operations, issues and challenges experienced, and the vision for BDA. Our research question was “How are healthcare organizations handling BD and BDA?” Better understanding of this reality serves not only to share best practices or challenges but also to inform decisions on resource allocation and opportunities for education of professionals to work with BD and BDA.
Methodology
The purpose of this study was to gain greater understanding on how BD and BDA are handled within healthcare organizations. To gain such perspective, the study evaluated experiences of professionals with healthcare BD and BDA. For this applied research, we followed the case study method, a qualitative research design.30 Case studies help explore an activity or process in depth and allow for detailed data collection through interviews of one or more individuals.31,32 The research was approved by the Institutional Review Board at Walden University.
The sampling strategy was purposeful and convenient. The research team focused on identifying individuals from various settings who worked with BD and BDA. Based on professional connections and LinkedIn profiles, we reached out to nine individuals in such roles (not all at once); over time, only six of them were available to participate in the study. We conducted six structured interviews with individuals whose main work was managing and/or analyzing healthcare big data. The interviews were completed virtually via Zoom and lasted between 45 and 60 minutes each. The principal investigator conducted structured interviews by following the pre-established interview protocol, which included an introduction to the study and researchers, verbal agreement to participate in the study, and questions in order, as presented below. Probes were also used at times to elaborate on some of the answers with further details and/or examples. The other two researchers were present during all interviews, recorded, and took notes. All interviewees were asked the following 11 standard open-ended questions:
- Can you please describe your role and how your organization’s big data team is structured for data collection and data analytics?
- What investments has your organization made to drive or support big data analytics?
- Can you briefly describe the types of questions your organization answers by using big data analytics?
- Can you briefly describe the types of decisions that are based on big data?
- What is your organization’s approach for integrating data analytics into operations?
- Sometimes a game changing opportunity arises, but the opportunity does not get vetted with evidence from the big data. Have you seen this happen in your organization? If so, can you give an example?
- How does your organization use big data to support population health?
- Now I’d like to focus on challenges in using big data. What are some of the frequent problems that big data analysts in your organization encounter?
- What are some solutions or approaches you have employed to overcome those challenges?
- Now, let’s talk about non-healthcare organizations that use healthcare big data.
- What are your thoughts on how device manufacturers, pharma, and insurance companies benefit from healthcare big data?
- What are your thoughts on how data companies such as Google, Amazon, and Microsoft benefit from healthcare big data?
- Finally, let’s talk about the future.
- What are your thoughts on how your organization will use big data in the future?
- Are there any new tools or resources your organization plans to use to improve the usage of big data and the experience with big data analytics?
- Given sufficient resources, what is your vision for an effective and efficient data analytics program in your organization?
After each interview, researchers discussed the main points that came out during the interviews. After the sixth interview, it was determined that the saturation point was reached, and no further outreach was made for additional interviews.33
The transcribed interviews were analyzed by using a summative content analysis approach. The summative approach focuses on identifying the essential aspect of the text and has been used successfully in analyzing interviews from healthcare professionals to examine complex text from diverse sources, including innovation in services or technology, which is similar to our research.34 This approach is also accommodating to differences (as opposed to only similarities), which is important in our study, given the diverse roles of interviewees and their experiences with BD and BDA.
Responses were coded based on the topics addressed through questions. Codes were aggregated into concept maps to group related codes into themes and show relations. While the use of standardized open-ended questions facilitated the data organization and analysis, some portions of answers that were provided under a certain question were moved to areas where they fit the topics better. For example, responses to questions 1 through 6 were categorized into: interviewee roles; organizational structure for BD and BDA; purpose of using BD and BDA; and dynamics/processes of using BD and BDA. The rest of the themes such as use of BD for population health, BD/BDA challenges, approaches in addressing such challenges, use of BD by non-healthcare organizations, and future directions were consistent with the questions asked. Another important note is that due to the diversity of the interviewees and organizations they represented, response analysis are mostly broken down by the type of organization.
Responses were coded by two researchers independently and discussed. No discrepancies were found, and 100 percent consensus was reached among the research team. All researchers engaged in recording, transcribing, discussing the text, identifying themes, key points, counting and comparisons of keywords and/or content, as well as the interpretation of the underlying context. Results of the surveys are organized and presented below.
Results
Six interviews were conducted with seven professionals who work with big data in different capacities and settings. To clarify the context of the results, where necessary, responses from interviewees that represented care provider organizations are discussed first, and responses from the quality management and the data platform representatives are summarized right after. Following are the findings from those interviews.
Interviewee Roles
Interviewee roles included the manager of healthcare data analytics at a large healthcare system in Pennsylvania, the chief research information officer at a university hospital in Ohio, the director of analytics and performance measurement along with a team member from a national quality organization in Virginia, a consultant and program manager at a private not-for-profit healthcare system in New Mexico, the senior director of engineering application at a large global data platform company in California, and the director of a data analytics consulting company in Missouri.
Organizational Structure for BD and BDA
Interviewees were asked about the formal organizational structure dedicated to working with BD, and they indicated that there is either a dedicated team/function, or department (such as a data analytics department) that is focused on working with health data. These teams were composed of business analysts, developers, data architects, engineers, clinicians, and occasionally health information specialists, and the size varied from a few to about 100 (the larger numbers correspond to larger health systems and the global data platform company). Additionally, staffing is done with internal employees and consultants. Consolidation of prior data analytics teams into one large function was mentioned by three of the interviewees. Despite the use of external resources, BD work is led and driven internally.
The way these teams function varies significantly, depending on the type and size of organization, as well as resources available. Two interviewees indicated that much of the BD work is conditioned by EPIC, the EHR used in the facility. In those cases, EPIC data and claims data are brought together into a common data governance platform. Physical servers are used, but cloud-based infrastructure is expanding.
How Are BD and BDA Used by Organizations – Purpose
Four interviewees shared that healthcare systems use BD and BDA to respond to regulatory requirements from the federal government, payers, or audit needs, as well as to fulfill executive and business unit requests. Requests mostly follow the industry trends and benchmarking, and a desire to stay ahead of the curve. One of the interviewees went into greater detail that BD and BDA are used to support optimal operations, shared saving, commercial contracts, Medicare shared savings, risk optimization, cost and utilization, as well as quality measures. Another interviewee shared that the organization uses BD and BDA to explore better ways of bundling services so that the facility does not lose money and possibly makes a profit to compensate for communities and services that are harder to pay for. A third interviewee shared that BD and BDA are used for predictive analytics around readmissions or to address questions pertaining to the health of communities around.
How Are BD and BDA Used by Organizations – Dynamics/Processes
Interviews revealed that the way BD/BDA are used varies from one organization to another. The care provider organizations that use EPIC had more in common. They capitalize on the templates and predictive models pushed by EPIC, given they run daily, and provide users with opportunities to act on the findings. Even when templates or models are not fully understood, there is trust in the vendor who provides the idea and tool. Often, such tools are integrated without a clear plan on how the information will be used, as in the case of a model that predicts the risk of a patient dying in the next year. Yet, three interviewees shared that some units have plans, or some have ideas about what they want but have no tool to develop it. Generally, the business side drives the types of analyses by telling IT what’s needed. IT explains what’s possible with the data and tools available. Results of BDA are used as a basis for operational and senior-level decisions, justification of investments, public health, care management, patient outreach, education, vendors, and for potential restructuring of the organization.
The interview with the individuals at the national quality organization showed a different process. Given that they are an organization that creates measures, ideas for quality measures are prioritized, and once decided, a technical expert panel defines the specifications for that measure. Then, the company uses the BD and BDA to apply specifications and test the measure for reliability and validity. For example, an opioid measure is tested, and then adjusted by removing certain populations, such as hospice or cancer patients. Measures are sometimes imposed by the Centers for Medicare and Medicaid Services (CMS), as well as driven by the National Quality Forum. Measures are often risk-adjusted for age, sickness, living location, race, ethnicity, and low-income status for Medicare. BD and BDA are also used to interpret clinical guidance with the data available. Lastly, they are used to maintain measurements; as clinical guidelines or literature review change, measures are re-tested.
The other distinct organization, the data platform company uses BD and BDA to assess how well the client company is using the data. They are able to trace and identify user-errors (as per regulations pertaining to data hosting services), identify faulty software, and use BDA to decide on how to prevent similar errors in the future. Such insight helps build better technologies to manage an organization’s data and test software as needed. Additionally, the company uses BD to understand product features, identify whether the product is working as it should, and proactively check quality of operations in the cloud platform and SAS platform.
How Is BD and BDA Used to Support Population Health?
When asked about how the organizations use BD and BDA to support population health initiatives, responses pertaining to care provider organizations had three areas in common: claims analytics; risk optimization; and quality measures. Claims data are heavily analyzed to identify opportunities for reducing costs and clinical variation, comparing utilization indicators, with peers, improving utilization and efficiency, as well as informing and supporting value-based contracts. One of the interviewees shared that geospatial analytics is also used to identify heat map areas in terms of cost-utilization for primary care facilities. Discussion on risk optimization was focused on better documentation of the level of risk, rather than BDA. Quality measures pertaining to the internal patient population are collected and reported. Additionally, there are efforts to understand the populations outside internal data sources. Depending on the request, the organization may include state or national level data that is publicly available. Two interviewees have community partnerships to address issues like health equity and social determinants of health. One organization uses the internal data available to make broad assumptions about the population (although access to the clinical data of that larger population may be limited or not available). Another organization is actively engaged with tribal leaders for outreach to minority communities and better population health management. The latter organization also performs spatial analysis and uses a geographic information system (GIS) and Microsoft platform, QlikView. Additionally, one interviewee shared progress in customizing a wellness program and diabetes predictive model for employees.
The interviewees from the national quality organization shared that they support population health tangentially by creating measures that drive incentives in the marketplace, which then drive health plans to manage population health and intervene as necessary. GIS or mapping algorithms are not used currently, but a machine learning algorithm would help identify the highest risk patients, or those most likely to be impacted.
The data platform company is mostly engaged in data collection exercises to understand peoples' behaviors and trends in relation to data. For example, spatial analysis is used to monitor air quality during California fires and decisions can be made accordingly. There is potential to build use cases software that help healthcare organizations not only monitor health data but also recognize patterns. Additionally, it was pointed out that there is potential to capitalize on data derived by sensors and IoT devices for better management of population health.
Challenges Pertaining to BD and BDA
When asked about the challenges observed in relation to BD and BDA, interviewees identified various aspects that are grouped into four categories: leadership; data literacy; system integration; and data characteristics. Challenges related to data characteristic are organized by volume, variety, velocity, veracity, value, and integrity.
Leadership-Related Challenges
All interviewees shared that organizational leadership is focusing on BD and BDA and dedicated teams (large or small) are in place. However, aside from the data platform company, others have yet to establish clear strategies, alignment of strategy with BD and BDA, and pathways for optimal BD use and collaboration within the various units and external parties. One interviewee said that there is lack of ownership of all required data sources to perform desired analytics, as well as lack of foundational infrastructure to support business needs. Another interviewee pointed out the leadership vacuum in certain areas. For example, in a university hospital, there are three important parties: clinicians; researchers; and administration. Clinicians are data generators, while researchers are data consumers. The administration follows the legal requirements: Family Educational Rights and Privacy Act controls teaching data, Health Insurance Portability and Accountability Act controls patient data, and Institutional Review Boards control research data. Tensions exist over the data management, and trusted relationships need to be developed among the three parties.
Data Literacy Challenges
All interviewees addressed that there is a need to improve data literacy across business operations. There are misinterpretations of graphs, and often, decisions are based on assumptions. There is a gap in translating business needs into what is possible to do with existing BD/BDA or how it could be possible. One interviewee mentioned that BD and BDA "is not something you can learn in a book. It is understanding what the data is telling you."
The data platform company shared that most users don’t use proper search terms, cannot do data analysis, or build a dashboard by using the SAS platform because they do not know the language to engage the platform. However, they are working on training users, as well as making the platform easier to use.
System Integration-Related Challenges
Five interviewees brought up that information is siloed. Integrating hospital clinical data with billing data, or claims data, or data from various practices is a challenge. People working with data question practices around patient duplication across the system or even proper physician identification in the multiple databases, given lack of proper integration. There are questions on how to index the data. As per one interviewee, “System standards exist but EHRs are customizable. For example, heparin control could be recorded in four different EHR locations in different organizations depending on the system customization. In the absence of guardrails, interoperability means relatively little; theoretically possible but it's pragmatically difficult because of choice.” There is concern that vendor competition and the market system in the US add to the challenge of integration.
Data Characteristics-Related Challenges
Five interviewees shared that there are challenges associated with handling large amounts of data. Not all organizations are provided with the equipment needed to analyze such volume. As per interviewee, “Using SAS in our computers or Optum landsite on a remote desktop can be limiting. We don’t use Hadoop or anything like that, where the processing resources are distributed across multiple machines.”
In terms of data variety, interviewees from the healthcare organizations and consulting companies indicated that the unstructured data was not being used for BDA, yet. The data platform company, on the other hand, indexes unstructured data and makes it structured by following certain schemas. After that, data cannot be changed. There is a risk of corrupting the data, so it is important to understand the data well prior to indexing it.
From the perspective of the data platform company, data velocity presents a challenge. One interviewee says that “based on budgets, there are limits on the daily ingestion rate, and we need to make the incoming data fit into those limits. Data is bursty.” Data flow and need for data varies throughout the day. So, from an operational perspective, decisions need to be made on how much of the system is needed throughout the day for a particular client. At the same time, the infrastructure must be provisioned to handle peak times, and adapted for scaling up and down.
Another data challenge aspect that was brought up by two interviewees was incorrect matching of data elements from old legacy systems with new systems. This process is not always accurate, and as per one interviewee, “variability of denominators should be questioned.” There are no tools or sufficient resources to assure data integrity for such issues, and most rely on manual reviews by users and analysts. This brings up data veracity challenges.
One interviewee shared that “a potential problem exists when buying a clinical or administrative dataset or billing dataset, like a market scan. Such data is used to determine that the most cost-effective treatment for an individual who has a heart attack, in general, is to run a drug-eluting stent. However, our population-level studies are subject to our specific population. Given the difference in population heterogeneity, how are we sure that the general treatment works for our group?” This data selection bias results in solutions or recommendations that do not work for certain populations, which diminishes its value.
When asked about overall data integrity, all interviewees addressed security. All comply with HIPAA regulations, but data security is a big challenge and a barrier that still prevents organizations from trusting cloud services. Data security was also mentioned as a constraint to patient matching by one interviewee, who said, “Despite having Medicare and Medicaid datasets, we can match some variables at a level of 30 percent but not the rest because of privacy.” Furthermore, on data security, the interviewee from the data platform company explained that their software can monitor whether a healthcare system is being hacked. Certifications guide them to features that are used and pushed as well as who can access the data. The amount and type of data coming in is a moving target and users need to understand where the data is going. If it goes beyond firewalls, it is inherently vulnerable. Questions about ownership are also discussed as part of agreements: “At what point do we own the data and at what point does the customer own the data? When does the exchange happen? That moment needs to be heavily secured.”
Another issue related to data integrity was data definition, discussed by four interviewees. There are inconsistent data structures and lack of standardization (mostly due to system customization, as pointed out on the example above). For example, there is no clear definition of a hospital admission in the system. There is also the absence of meta data standards, as well as lack of data dictionaries. As per one of the interviewees, “one data set has over 10,000 tables. How do you navigate 10,000 tables? How do you find the variables you're looking for? It's there, and analysts have to go digging around to find them.” Clear definitions would also help with query inclusion and exclusion criteria.
Data completeness challenges (another aspect of data integrity) were also identified by four interviewees. As per one interviewee, “health equity relies on race, ethnicity, language data, and that data is not captured well.” Claims data also does not tell the whole picture. As per another interviewee, “claims data is not perfect, as it does not include encounters paid in cash.” These comments relate to data capture, availability, and comprehensiveness of a data governance program needed to properly address various healthcare initiatives, such as population health and health equity.
How Are BD and BDA Challenges Being Addressed?
As per all interviewees, organizations have recognized the BD and BDA challenges, and have discussed the work in progress to address them. All plan to add data analyst positions, and some plan to restructure data teams under one leadership. One organization plans to create an analyst class position acting as middleman between the “research or operations question” and the data. Such a position would help with the better understanding of the business needs and the data needed to support those needs, as well as simplify and create abstractions in the data that could be analyzed.
Five interviewees brought up the need for robust data governance programs, documenting the true sources of data, monitoring data movement, potentially bringing some data in house or potentially using third-party payers to augment data, and using an agile methodology regarding the Master Patient Index project. Four interviewees were evaluating current tools and options offered by the EHR, working to improve the matching of data elements, and bringing solutions to the warehouse or the data layer. This would address existing problems of the visualization layer, which is currently suffering because of the siloed data. From a process perspective, one interviewee said, “more frequent touch points should be added with the management to clarify technical aspects and how to tailor them to meet business needs.” Organizations are also supporting data standardization and integration projects. As per the interviewee from the consulting company, consultants are also helping organizations with data governance and integration issues.
Thoughts on Non-Healthcare Organizations Using BD and BDA
All interviewees were asked about what they think about non-healthcare organizations that use healthcare big data, how device manufacturers, pharma, and insurance companies benefit from healthcare big data, and how data companies such as Google, Amazon, and Microsoft benefit from healthcare big data. Responses were interesting, as they showed a variety of perspectives from the interviewees.
The main emerging theme was an overall positive view of tech companies, including Google, Apple, Amazon, and Microsoft. They are viewed as building value in general and for healthcare. It was pointed out that healthcare can learn from such industries, or even the manufacturing industry, when it comes to BD and BDA. Additionally, industries trying to enter the healthcare space should be collaborating better with the healthcare providers. The interviewee from the data platform company mentioned that sometimes tech companies engage in “fun” activities that produce incidental findings that provide great insight into certain healthcare behaviors. Such capabilities are not tapped but could be with greater collaboration.
The second theme was the need for non-healthcare organizations to be more responsible in working with health data. All interviewees mentioned that there is a risk of incorrect interpretation or misinterpretation of health data. While in the business world that may affect sales of a product, in healthcare it affects patient safety and “it puts patients at risk.” There is a belief (brought up by three interviewees) that when non-healthcare organizations use machine learning to support certain programs, the primary goals are financial, which makes them less trusted in their motives for creating patient programs. One interviewee said that non-healthcare organization tend to hoard data, and not share it to make more money.
One of the interviewees brought up the idea of distinguishing between clinical and health data. For example, clinical data is generated by clinicians, while health data is generated by medical devices such as Fitbit watches or wearable devices used for continuous glucose monitoring or other medical issues. This same interviewee discussed the economic perspective by saying, “you could not have built these devices if not for the healthcare and technological infrastructure that exists. They don’t pay for data collection, there is no cost of input, and they extract value without contributing to the overall healthcare system. This is a distortion of market factors.” It was believed that players such as these add value, but they are working in an environment without principles. “They should be building value for the public good. As they increase their financial earnings, they have greater potential to make their ideas more tangible. I would like to see some of these successful companies sponsor people who do not have access to money, power, and resources; so, they can also bring their ideas to life and possibly change the healthcare system for the better. That additional level of connectedness that really separates; that creates social disparities should be addressed.”
Future Directions in Relation to BD and BDA
The five interviewees from healthcare organizations indicated that the vision is to invest in BD and BDA infrastructure and technology and expand use of BD while aligning better with strategy. Ideas for investment included: (1) integration of external data sources with internal analytical capabilities; (2) creation of a data lake with potential to coordinate data integration across multiple organizations and allow for data abstraction; (3) development of better approaches for software acquisition, with the goal continuing to use what’s available in the current data warehouse, despite the new software; (4) formation of teams that specialize in analytics and decision support across the various business units that would guide operational leadership in making informed decisions with data; (5) support for greater patient access to their own data; (6) increased interoperability through data standards; (7) cloud-based expansion; and (8) better data standards for population health needs.
Ideas for expanding BD uses included leveraging data better to monitor population trends, support population health initiatives, and reusing the data for multiple types of research. Coordinating the brain power of different and more experts is very important, and “work should be done on signing agreements with other healthcare organizations (including optometry and dental services) to share identifiable data in a HIPAA compliant manner, with proper security layers to protect the data from misuse.” Additionally, there was a suggestion to consider a centralized versus a federated approach, where every clinical area does their own analytical work. With the right tools and knowledge, as well as proper data governance, people can engage in self-service discovery more effectively. Patient engagement was also identified as an important future direction, from the perspective of allowing data sharing for research, as well finding ways to engage patients when not in clinical care. Furthermore, one interviewee suggested shifting focus from administrative-facing to public-facing conversations, meaning greater engagement with patients and the public regarding health data.
Four interviewees indicated that their organizations are discussing the use of AI but did not have details for specific AI tools. One of the healthcare organizations is using machine learning to predict the likelihood that a patient may develop sepsis. The data platform company already uses machine learning heavily during data acquisition. They follow user requests such as identify and acquire only the data that matches a particular model and discards the rest. The company plans to continue to evaluate the amount of data being collected and the data filters. Filters intend to clean the noise but may remove valuable data in the process, so finetuning of the algorithms will be ongoing. Balance between too much or not enough data will continue to be monitored, while acknowledging that certain data need to be kept for compliance reasons, as evidence. Additionally, the vision is to elevate the search process and make it easier for the user to run queries and use the data.
Discussion
Our findings add rich context and details in understanding how healthcare organizations are managing BD and BDA. While formal structures for BD and BDA exist in many organizations, there is a variety of approaches on how that is done, and organizations are seeking improvement in that area. The variety of structures, tools, and practices in managing BD and BDA can generate questions for further research that could dive deeper into specific operational aspects. As found in literature review, it is important to develop evaluation tools, comparative studies, best practices, or structural models that will help providers or other users of healthcare BD and BDA to make more informed decisions when they are purchasing BD sets and BDA technologies, hiring staff, or structuring/restructuring their data analytics functions.20,35,36 Decision-makers should be able to evaluate such investment proposals based on well-established criteria and resources (and not solely those served by the vendors). This also highlights opportunities for building trust and improving collaboration with non-healthcare organizations whose strength is working efficiently with BD and BDA.37
Challenges associated with BD and BDA are also consistent with those in the literature review.1,9,11,38,39 Data definition, accuracy, and completeness become even more important when data is aggregated and analyzed in larger scales. Not only does data quality affect current decisions about patient populations, but it also affects future decisions, given that machines learn based on those inaccurate, incomplete, poorly defined data pools. This has multiple implications. First, healthcare organizations need to invest resources in improving the integrity of their existing data pools (and this should go beyond patient matching). Second, everyone who touches and uses health data should be educated and/or trained in managing health data with integrity during all stages, including acquisition, extraction, cleaning, integration, and aggregation. Third, any tools that are used for automated data collection should be carefully evaluated at the beginning and monitored throughout to assure the data integrity is intact.
Health data literacy came up as an important aspect of BD/BDA challenges. Lack of skilled staff was pointed out early on and continues to come up in research as the need for upskilling of the workforce is ongoing.40,41,42,43 There is an opportunity for health information professionals to contribute in the process of elevating health data literacy among healthcare professionals, as well as non-healthcare professionals who work with health data. This will also contribute positively in trust building and collaboration efforts mentioned above.
Closely related to health data literacy is another important finding, the potential to improve data governance programs. Data governance is one of the most important domains for health information professionals. As organizations use their home-grown BD or acquire BD from other organizations, it is imperative to accompany such activities with data dictionaries and relevant terminologies. Other data governance aspects are data lifecycle, data architecture, metadata, data quality, and security, all of which present opportunities for health information professionals’ leadership and sharing of expertise.44
Findings show great efforts in addressing population health issues and health equity, along with the need for more complete and accurate social determinants of health data. AHIMA has already recognized this opportunity and is leading the way in creating relevant standards and identifying specific training needs for healthcare workers.45
In addition to the significance for further research or potential work areas for health information professionals, findings from this study may be useful for educational purposes. Current literature and textbooks used in health information, healthcare management, and healthcare services provide general overviews, discuss the importance and potential of BD and BDA, as well as specific tools used successfully in particular settings.1, 37, 46 Knowledge about BD and BDA operational and technical aspects is usually part of information technology programs and is currently lacking in the space of health information and other healthcare studies. As per findings, IT professionals are working very closely with health information professionals, clinicians, and administration, but they currently experience barriers in communication, i.e., they do not fully understand each other’s language. It is imperative to bridge this gap and improve data literacy among all clinical and non-clinical participants in healthcare who touch health data or make decisions related to BD/BDA. Details of this study contribute to such a body of knowledge.
Limitations
This study focused on better understanding of BD and BDA operations and practices in healthcare. The open-ended structured interview protocol enabled collection of rich answers filled with details and examples. It also allowed for greater comparability of responses and getting a complete data set for each question or subtopic.30 The questions asked required mostly facts and objective information that reflected the interviewee’s knowledge and experiences, which are recent (since BD and BDA are a recent reality in healthcare). This is a strength as findings rely less on perceptions or subjective data. Findings from this study may also serve as stimulation for new research pertaining to BD and BDA. As with most qualitative studies, findings are not highly generalizable; however, there are opportunities for similar research by expanding the interviewee pool and the types of organizations they represent. While details shared are mostly related to the interviewee’s recent roles with BD and BDA, it is possible that there may have been errors of memory and/or judgment. Certain details may not have come up, which creates a less than full picture of BD and BDA reality. Additionally, the authors recognize the fast-paced technological environment and growth of AI tools between 2023 and 2024 (which is after the interviews were conducted). Such progress has yet to be realized in all settings of the healthcare industry, and the findings from the study are still relevant as pointed out in the discussions section.
Conclusions
This study provides greater insight into how BD and BDA are being managed and used in various healthcare organizations as well as by vendors servicing healthcare providers. Given the very complex and diverse healthcare landscape in the US, our attempt was not to obtain a full picture of such reality but to better recognizing some of the BD/BDA realities. Such knowledge, details, and examples help all who work with health data to better understand their role and potential contribution to the management and use of BD. They also contribute to greater effectiveness and efficiency in processing and using BD and BDA meaningfully, in today’s digital healthcare environment. Lastly, some of the findings validate the work and role of health information professionals when it comes to BD and BDA in healthcare.
References
1. Shilo, S., Rossman, H. and Segal, E. “Axes of a revolution: challenges and promises of big data in healthcare.” Nat Med 26, 29–38 (2020). https://doi.org/10.1038/s41591-019-0727-5.
2. Gartner. “Gartner Identifies the Top 10 Strategic Technology Trends for 2023.” (October 2022). Accessed September 2023. https://www.gartner.com/en/newsroom/press-releases/2022-10-17-gartner-identifies-the-top-10-strategic-technology-trends-for-2023.
3. Mayer-Schönberger, Viktor and Cukier, Kenneth. Big data: A revolution that will transform how we live, work, and think. Houghton Mifflin Harcourt. (2013). Accessed October 2022. https://psycnet.apa.org/record/2013-17650-000.
4. Van Dijck. “Datafication, dataism and dataveillance: Big Data between scientific paradigm and ideology.” Surveillance and Society 12 No. 2 (2014). https://doi.org/10.24908/ss.v12i2.4776.
5. Iyamu, Tiko. “Advancing Big Data Analytics for Healthcare Service Delivery.” London; Routledge, 2023.
6. Sivarajah, Uthaysankar, Kamal, Muhhamad Mustafa, Irani, Zahir and Weerakkody, Vishanth. “Critical analysis of Big Data challenges and analytical methods.” Journal of Business Research 70, no. C (2017): 263-286. https://doi.org/10.1016/j.jbusres.2016.08.001.
7. Reichman OJ, Jones Matthew B, and Schildhauer Mark P. “Challenges and opportunities of open data in ecology.” Science 331, no. 6018 (February 2011):703-5 https://pubmed.ncbi.nlm.nih.gov/21311007/.
8. Segaran, Toby & Hammerbacher, Jeff. Beautiful data: the stories behind elegant data solutions. O’Reilly, 2009.
9. Schintler, Laurie A. (Laurie Anne), and Connie L. McNeely, eds. 2022. Encyclopedia of Big Data. 1st ed. Cham: Springer International Publishing, 2022. https://doi.org/10.1007/978-3-319-32010-6.
10. Kong, Hyoun-Joong. “Managing Unstructured Big Data in Healthcare System.” Healthcare Informatics Research 25, no.1 (January 2019): 1-2. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6372467/.
11. Mønsted T. “Achieving veracity: A study of the development and use of an information system for data analysis in preventive healthcare.” Health Informatics Journal 25(3) (2019):491-499. https://pubmed.ncbi.nlm.nih.gov/30198372/.
12. Liu, Yong-Chuan, Atefeh Farzindar, and Mingbo Gong. “Transforming Healthcare with Big Data and AI.” Charlotte, North Carolina: Information Age Publishing, Inc., 2020.
13. Sandhu, R and Sood, SK. “Scheduling of big data applications on distributed cloud based on QoS parameters.” Cluster Computing, 18 no. 2 (December 2014) doi:10.1007/s10586-014-0416-6.
14. Zhang, Feng, Min Liu, Feng Gui, Weiming Shen, Abdallah Shami, and Yunlong Ma. "A distributed frequent itemset mining algorithm using Spark for Big Data analytics." Cluster Computing 18, no. 4 (2015): 1493-1501.
15. Yi, Xiaomeng, Fangming Liu, Jiangchuan Liu, and Hai Jin. "Building a network highway for big data: architecture and challenges." IEEE 28, no. 4 (2014): 5-13.
16. Hugh, Oliver, and Jason Gardosi. “Use of Microsoft Power BI to Display Pregnancy Related Performance Statistics within NHS Trusts.” International Journal of Population Data Science 8 (2) (2023). https://doi.org/10.23889/ijpds.v8i2.2342.
17. Tiko, Iyamu. Advancing Big Data Analytics for Healthcare Service Delivery, 1st ed. Routledge. New York, 2023.
18. Jagadish, Hosagrahar V, Johannes Gehrke, Alexandros Labrinidis, Yannis Papakonstantinou, Jignesh M. Patel, Raghu Ramakrishnan, and Cyrus Shahabi. "Big data and its technical challenges." Communications of the ACM 57, no. 7 (2014): 86-94.
19. Dash, Sabyasachi, Sushil Kumar Shakyawar, Mohit Sharma, and Sandeep Kaushik. “Big Data in Healthcare: Management, Analysis and Future Prospects.” Journal of Big Data 6 (1) (2019): 1–25. https://doi.org/10.1186/s40537-019-0217-0.
20. Senthilkumar, S. A., Bharatendara K. Rai, Amruta A. Meshram, Angappa Gunasekaran, and S. Chandrakumarmangalam. "Big data in healthcare management: a review of literature." American Journal of Theoretical and Applied Business 4, no. 2 (2018): 57-69.
21. Murdoch, Travis B, and Detsky, Allan S. “The inevitable application of big data to health care.” JAMA, 309, no. 13 (2013): 1351–1352. https://doi.org/10.1001/jama.2013.393.
22. Zeng, Jing, and Glaister, Keith W. “Value creation from big data: Looking inside the black box.” Strategic Organization 16, no. 2 (2018): 105–140. https://doi.org/10.1177/1476127017697510.
23. Chen, Jinchuan, Yueguo Chen, Xiaoyong Du, Cuiping Li, Jiaheng Lu, Suyun Zhao, and Xuan Zhou. "Big data challenge: a data management perspective." Frontiers of Computer Science 7, no. 2 (2013): 157-164.
24. Agrawal, R. and Prabakaran, S. “Big data in digital healthcare: lessons learnt and recommendations for general practice.” Heredity 124 (2020): 525–534. https://doi.org/10.1038/s41437-020-0303-2.
25. Wang, C. Jason, Chun Y Ng, and Robert H Brook. 2020. “Response to COVID-19 in Taiwan: Big Data Analytics, New Technology, and Proactive Testing.” The Journal of the American Medical Association 323 (14): 1341–42. https://doi.org/10.1001/jama.2020.3151.
26. Simon, Gregory E. 2019. “Big Data from Health Records in Mental Health Care: Hardly Clairvoyant But Already Useful.” JAMA Psychiatry (Chicago, Ill.) 76 (4): 349–50. https://doi.org/10.1001/jamapsychiatry.2018.4510.
27. Golbus, Jessica R, W Nicholson Price, and Brahmajee K Nallamothu. “Privacy Gaps for Digital Cardiology Data: Big Problems with Big Data.” Circulation (New York, N.Y.) 141 (8) (2020): 613–15. https://doi.org/10.1161/CIRCULATIONAHA.119.044966.
28. Frakt, AB, and Pizer, SD. “The promise and perils of big data in healthcare.” The American Journal of Managed Care 22, no. 2 (2016): 98–99.
29. Sivarajah, U., Kamal, MM, Irani, Z., and Weerakkody, V. “Critical analysis of Big Data challenges and analytical methods.” Journal of Business Research. 70 (2017); 263-286.
30. Patton, Michael Quinn. Qualitative Research and Evaluation Methods, 3rd ed. Sage Publications, Inc. 2002, p. 339-427.
31. Creswell, John W. Research Design, 3rd ed. Sage Publications, Inc. 2009, p. 173-200.
32. Ritchie, Jane and Lewis, Jane. Qualitative Research Practice: A Guide for Social Science Students and Researchers, 1st ed. Sage Publications, Inc. 2003, p. 173-200.
33. Ishak, NM and Bakar, AYA. “Qualitative data management and analysis using NVivo: An approach used to examine leadership qualities among student leaders.” Education Research Journal. Vol 2.(3) (March 2012): 94-103.
34. Rapport, Frances. “Summative Analysis: A Qualitative Method for Social Science and Health Research.” International Journal of Qualitative Methods. (September 2010). https://doi.org/10.1177/160940691000900303.
35. Riaz Ahmed, Sumayya Shaheen, and Simon P. Philbin. “The role of big data analytics and decision-making in achieving project success.” Journal of Engineering and Technology Management, 65, (July–September 2022): 101697. https://doi.org/10.1016/j.jengtecman.2022.101697.
36. Dobre, Ciprian and Xhafa, Fatos. “Intelligent services for Big Data science.” Future Generation Computer Science. 137 (July 2014): 267-281. https://www.sciencedirect.com/science/article/abs/pii/S0167739X13001593.
37. Batko, K., & Ślęzak, A. “The use of Big Data Analytics in healthcare.” Journal of big data, 9(1), (2022): 3. https://doi.org/10.1186/s40537-021-00553-4.
38. Gandomi, Amir, and Murtaza Haider. "Beyond the hype: Big data concepts, methods, and analytics." International Journal of Information Management 35, no. 2 (2015): 137-144.
39. Frost and Sullivan White Paper. “Drowning in Big Data? Reducing Information Technology Complexities and Costs for Healthcare Organizations.” Accessed October 3, 2023. https://www.academia.edu/6563567/A_Frost_and_Sullivan_White_Paper_Drowning_in_Big_Data_Reducing_Information_Technology_Complexities_and_Costs_For_Healthcare_Organizations_CONTENTS.
40. Kim, Gang-Hoon, Trimi, Silvana, and Chung, Ji-Hyong. “Big-Data Applications in the Government Sector.” Communications of the ACM 57 (2014): 78-85. https://dl.acm.org/doi/10.1145/2500873.
41. NORC at the University of Chicago & AHIMA. “Health Information Workforce: Survey Results on Workforce Challenges and the Role of Emerging Technologies.” October 2023. https://www.norc.org/research/projects/workforce-challenges-technology-adoption-health-information-professionals.html.
42. Sørensen K. “From Project-Based Health Literacy Data and Measurement to an Integrated System of Analytics and Insights: Enhancing Data-Driven Value Creation in Health-Literate Organizations.” Int J Environ Res Public Health. 2022 Oct 14;19(20):13210. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9603602/.
43. Lubasch JS, Voigt-Barbarowicz M, Lippke S, et al. “Improving professional health literacy in hospitals: study protocol of a participatory codesign and implementation study.” BMJ Open 11 (2021): e045835. https://pubmed.ncbi.nlm.nih.gov/34400444/.
44. Oachs P & Watters AL. Health Information Concepts, Principles and Practice, 6th ed. AHIMA Press, 2020.
45. NORC at the University of Chicago & AHIMA. “Social Determinants of Health Data: Survey Results on the Collection, Integration, and Use.” February 2023. https://www.norc.org/content/dam/norc-org/pdf2023/AHIMA-Workforce-Survey-Report-Final-2023.pdf.
46. Chan, Chien-Lung, and Chi-Chang Chang. "Big Data, Decision Models, and Public Health" International Journal of Environmental Research and Public Health 19(14) (2022): 8543. https://doi.org/10.3390/ijerph19148543.
Author Biographies
Egondu R. Onyejekwe, PhD, MSc., MA., MA., MS. Engr., is a director of Serve, Educate, Elevate (S.E.E.) program. She currently works for the National Council of Negro Women Inc., Columbus, Ohio Section (NCNWCSO). Prior to this, Onyejekwe was a faculty at Walden University.
Dasantila Sherifi, PhD, MBA, RHIA, is an assistant professor and program director for Health Information Management at Rutgers University. She received her Doctor in Philosophy degree in Health Services with specialization in Public Health Policy from Walden University and her Master’s in Business Administration from Southern Illinois University.
Hung Ching, PhD, DABR, is a senior medical physicist at Memorial Kettering Cancer Center. He received his Doctor of Philosophy degree in public health from Walden University and his master’s and bachelor’s degrees in physics from the State University of New York at Stony Brook. He is licensed to practice medical physics in New York and New Jersey. He is board-certified by the American Board of Radiology in the field of diagnostic medical physics.