Nam-Gyu Lee and Seung-Hee KimDesign and Development of a Summarization Service Prototype of Personal Health Record Medical Data with Fast Healthcare Interoperability Resources Structure Using Large Language ModelsAbstract: As the scope of healthcare data expands beyond hospital-generated electronic medical records (EMRs) to include personal health records (PHRs), there is a growing need for automated methods to efficiently summarize large volumes of patient-specific information. In this study, we propose a summarization approach that leverages large language models (LLMs) and standardized data formats to improve accessibility and usability of patient data. Specifically, we developed a prototype system that summarizes patient Bundles formatted in accordance with the Fast Healthcare Interoperability Resources (FHIR) standard. Using the ChatGPT API and document processing techniques, we generated summaries and evaluated their accuracy using a checklist based on clinical criteria. The summarization model achieved an accuracy of 81.6%, suggesting its potential for real-world application. Our findings indicate that healthcare professionals can more quickly and effectively review a patient’s primary conditions using summarized PHR data, particularly as FHIR adoption increases. However, the results also highlight certain limitations, including the generalization of summaries and the absence of domain-specific fine-tuning. These findings underscore the importance of future research involving multidisciplinary clinical evaluations, targeted fine-tuning strategies, and question-driven summarization to enhance accuracy and clinical relevance. Overall, this study demonstrates the feasibility of integrating LLM-based summarization into healthcare workflows, contributing to improved interoperability and decision-making in clinical settings. Keywords: Electronic Medical Record , Fast Healthcare Interoperability Resources , Generative Artificial Intelligence , Large Language Model , Personal Health Record , ChatGPT 1. IntroductionIn most countries, healthcare environments have steadily progressed with the development of electronic medical record (EMR) systems [1], the establishment of data exchange standards, and improvements in data quality. Consequently, technologies that enable rapid comprehension and processing of vast amounts of data have become increasingly important in the healthcare field. In particular, data recorded by patients using smartwatches or digital therapeutics, as well as linear data from continuous glucose monitors, are accumulating to form a foundation for precision medicine based on personal health record (PHR) data. However, the growing task of individually reviewing and verifying such large volumes of data is an additional burden on healthcare professionals. Compounding this issue, the average consultation time for doctors in South Korea is only 5 minutes, which is notably short [2]. Furthermore, disparities in the utilization of medical data may arise depending on the size of the healthcare institution, potentially leading to various side effects. The South Korean government has implemented a plan to facilitate data linkage between hospitals to address the rising national health insurance premiums and reduce the waste of national finances because of health insurance spending, which was expected to increase by 9.3% until 2023 owing to re-examinations [3] performed at each hospital and examinations for patients at different hospitals. This data linkage is based on the Fast Healthcare Interoperability Resource (FHIR) data structure. FHIR is a standard framework for exchanging medical data structured in JavaScript Object Notation (JSON), extensible markup language (XML), unified modeling language (UML), newline delimited JSON (ND-JSON), or resource description framework (RDF) formats. A primary challenge with the FHIR data structure is the significant increase in the volume of data that healthcare professionals must review, depending on the patient’s condition. For instance, if a patient has been in an intensive care unit for 2 weeks, over 1,200 test results, vital signs, and medication records are compiled into FHIR resources and converted into a JSON format. As a result, the resulting file can be approximately 200 Mbytes in size, representing a substantial amount of data to manage. This could be considered a medical risk from the perspective of healthcare institutions and professionals who need to provide faster and more accurate medical services. In particular, in the case of tertiary hospitals, healthcare professionals must treat many patients in a short period. Consequently, using the FHIR data structure can pose a significant burden on them; hence, resistance from healthcare professionals can also be anticipated. Therefore, developing technology that can simplify medical data information by interpreting the FHIR data structure is necessary. In this study, we designed a technology that summarizes patients’ health information based on FHIR data and developed a prototype to validate its feasibility. To achieve this, we proposed a method that transforms FHIR data into single-row data with specific patterns and then converts the data into user-readable text using a large language model (LLM). The proposed technology can generate briefing materials that offer insights into patients’ conditions and characteristics from vast amounts of FHIR data. In addition to supporting the government’s policy to promote and facilitate data linkage among healthcare institutions, it can proactively eliminate potential medical risks and major obstacles that healthcare professionals may encounter in the field. Consequently, this technology may contribute to the stable implementation of the policy and enhance the quality of healthcare services. 2. Preliminary Research2.1 Background ResearchFor this study, we investigated recent domestic and international research on the clinical applications and standardization of FHIR-formatted data. In addition, we reviewed studies on the significance of FHIR data standardization in medical data analysis and utilization. In [4], a data analysis framework supporting clinical statistics through an FHIR data model was developed, and various workflows using FHIR APIs were designed to facilitate patient-centered and cohort-based interactive analysis across multiple hospital information systems. In [5], a pipeline (FHIR-DHP) for querying standardized clinical data was established. This pipeline involved transforming hospital records into artificial intelligence (AI)-friendly data through a process of data querying, FHIR mapping, syntax validation, transferring data to a patient model database, and exporting data for AI applications, thereby providing a standard for modeling clinical datasets. In [6] and [7], researches were conducted on how the FHIR standard enhanced the large-scale deployment security, and interoperability of medical data—particularly how FHIR APIs could access, analyze, and apply medical data using AI. This study utilized a structure similar to the Hooks method defined in Scope 2 of the aforementioned study. The studies proposed various standardization methods necessary for utilizing FHIR data. Moreover, they explored how AI-based personalization could innovate the analysis and utilization of medical data through international research related to general medical data. Subsequently, the following studies utilized FHIR-formatted data. In [7], research was conducted to improve interoperability by converting three-dimensional medical images into JSON format and incorporating clinical information using an HL7 FHIR server and client information system. In [8], an FHIR-based PHR system was developed, enabling users to access the system via smartphones. The system was implemented and verified using dedicated FHIR Open APIs. In [9], an FHIR-based depression scale questionnaire was implemented using a chatbot, and its interoperability and usability were validated. These studies proposed methods to maximize the convenient utilization of FHIR data. Next, the following studies serve as references for proposing policies and directions for the application of AI-based medical services. In [10], an analysis was conducted on how technological, ethical, and regulatory concerns influence the use of AI tools, focusing on expectations for AI tools and consumer perceptions and acceptance of these technologies. In [11], a range of topics, including bioinformatics, clinical informatics, and medical imaging informatics, were covered; the study examined how AI could enhance patient-centered diagnosis and treatment, providing a systematic approach to how generative AI could bring innovative changes to personalized healthcare. In [12], the study reviewed AI’s contribution to clinical decision-making, hospital operations optimization, medical image analysis, and patient management and monitoring in healthcare settings. Additionally, the ethical issues and data privacy concerns associated with AI integration, as well as the importance of addressing data bias, were discussed. In [13], topic modeling and bibliometric analysis were performed on multimodal data analysis and AI applications for smart healthcare, identifying various research topics related to AI and multimodal data convergence in medical data analysis. These studies provide insights into how the combination of AI and medical services may drive societal changes in the future and shape the development of relevant policies. Finally, in [14], a model utilizing deep learning technology was developed to predict medical events such as in-hospital mortality, 30-day readmission, and duration of stay. The predictions were based on data collected from hospitalized patients within the first 24 hours of admission. Comparing the themes and research methods of the aforementioned studies with those of this study, references [4] and [5] focus on dataset construction and standardization techniques for data analysis. By contrast, this study introduces a method for data preprocessing by applying the FHIR standard and document embedding technology. Additionally, studies [6-9] emphasize approaches to enhance signature processing, interoperability, and usability in large-scale data processing systems that require security. However, this study serves as a tool to verify that the LLM summary model meets these criteria through the ChatGPT application programming API. Furthermore, the focus of this study on improving consumer-centered medical service systems aligns with the important themes commonly recognized in studies [10-12], which address policy direction and the ethical use of data, thereby underscoring the significance of this research. Moreover, studies [13,14] highlight limitations in applying technology to real-world practice, particularly concerning liability issues arising from prediction uncertainties. These previous studies provided a foundation for analyzing the ethical use of medical data, technical transmission structures, and trends in standardization technologies. The key distinction of this study from existing research is its implementation of personalized services using LLM technology to practically leverage standardized medical data in FHIR format while evaluating its practical applicability. In contrast to previous studies that primarily focused on predictive models based on medical data, this study focuses on summarizing individual PHR data for each patient. As a result, no solution has been developed to summarize PHR data based on FHIR standards; it avoids the liability issues related to medical decision-making present in previous research and offers more flexibility or practical application. 2.2 Open Data Structure of Substitutable Medical Applications Reusable Technology (SMART)Substitutable medical applications reusable technology (SMART) is a framework designed to enhance the interoperability and reusability of health information and healthcare technologies related to EMR (I) systems, facilitating the seamless integration of various applications and technologies within healthcare systems. The development of SMART was led by Harvard University in the United States, in collaboration with the US Department of Health and Human Services (HHS)—which manages healthcare, welfare, and public health policies, similar to the role of the Ministry of Health and Welfare in South Korea—and for-profit organizations such as EPIC HealthCare [15]. Key features of SMART include interoperability, which allows smooth communication and data exchange between different healthcare applications and systems; reusability, which ensures that healthcare applications and technologies can be deployed across multiple environments; modularity, which permits different components of a healthcare system to function independently while enabling them to be combined or replaced as needed; and standardization, which maintains compatibility and data consistency by adhering to standardized protocols and data formats [16]. An FHIR resource refers to data from each medical domain, and among these, the Bundle resource is one of the most critical elements. The Bundle resource is a container that aggregates multiple resources [17], enabling the grouping of various resources into a single instance. Table 1. Structure of an FHIR Bundle
For example, Document Bundles are used for clinical documents such as discharge summaries or progress notes; Message Bundles are employed for system messaging, including alerts or updates; and Transaction Bundles facilitate atomic operations, where multiple resources are created, modified, or deleted in a single transaction. Therefore, understanding the structure of Bundles is essential for the effective implementation and utilization of FHIR in medical systems. The structure of an FHIR Bundle includes elements such as Bundle type, Identifier, Timestamp, Total, Link, and Entry. Table 1 lists details of each element's concept and role [18]. Fig. 1 provides a partial example of a patient information dataset defined according to the FHIR Bundle structure. 3. Research Procedure and QuestionnaireAs shown in Fig. 2, the algorithm of the proposed medical information summarization service operates as follows: when data are input in an FHIR Bundle-based data structure, they first pass through a filter and undergo preprocessing. Subsequently, the Embedding Document is utilized to create a summarization request to the ChatGPT API. 3.1 Data RefiningMedical data are generally imbalanced, with a long-tail distribution (LTD) structure. Additionally, most FHIR Bundle resources for a patient include overhead because they define interoperable protocols and standards, incorporating values that are readily usable in summarization services. For example, the content shown in Fig. 3 is a part of the Condition resource of a patient, structured according to the criteria defined in US-Core 6.1.0, which is the FHIR standard used in the United States. The diagnostic information that healthcare professionals can utilize is highlighted in yellow, whereas the remaining data primarily consist of resource framework-related information, code-based references, and system-generated identifiers. Because the non-highlighted information is not meaningful in the clinical decision-making process, a functionality to filter such data during the preprocessing stage is required. 3.2 Data Preprocessing3.2.1 Data wrangling, cleaning, and transformation Various FHIR-based resources can be included in an FHIR Bundle Resource. These resources may encompass not only data converted from EMRs generated within medical institutions but also PHR data from users' assistive health devices. Therefore, in datasets consisting of refined FHIR resources, preprocessing is required for specific missing data, data unlikely to be associated with medical information based on the FHIR structure (such as billing data), and data with missing reference values. Furthermore, a process is conducted to consolidate multiple resources that are scattered across the patient's treatment dates, transform them into a single text format, and return the result. 3.2.2 Ethical considerations In adherence to ethical, legal, and technical regulations and principles pertinent to FHIR-based public data research within the medical domain, data collection was limited to the minimum necessary, explicitly excluding any personally identifiable information. Furthermore, an additional authentication procedure was implemented during the data collection process from the FHIR server via the API, ensuring that the data were not utilized for purposes beyond this study. Consequently, the data collected in this study were exempt from ethical review, as there was no subject for privacy concerns or de-identification, thereby eliminating the risk of re-identification. 3.3 Composition of Learning Embedding InformationThe purpose of summarizing medical data is to condense vast amounts of data. In this study, we utilized the LLM summarization method using the ChatGPT API. However, because the ChatGPT model has primarily been trained on general data, important information might be omitted, or an unreliable summary or hallucination of data might be produced during the medical information summarization process. Therefore, in this study, we employed the model in its default configuration and concentrated on constructing Embedding Documents for the data intended for summarization. We designed the summarization service to enable the querying of the LLM with specific data for each segment, if necessary, prior to the summarization process. 3.4 Request for Information Summarization based on LLMWe developed a service utilizing the ChatGPT APIs for LLM-based transformation of medical data. Furthermore, the service was designed to adjust various hyperparameters, such as temperature settings, to optimize the API-based summarization model. 3.5 Design of Experimental DataArticle 21 of the Medical Service Act in South Korea stipulates that de-identified medical data must be exported with hospital approval and in accordance with proper procedures. In this study, we used publicly available FHIR resources [19], provided by SMART on FHIR, to bypass the need for these procedures. The publicly available data adhered to the FHIR R4 standard and were based on US-Core 6.1.0. Each Bundle was composed of randomly constructed patient EMR data. In this study, published FHIR data were employed to ensure data privacy, address bias, and meet the quality requirements of experimental data. During the data preprocessing phase in FHIR’s Patient, all structurally personal information, except for gender and age, was removed. 3.6 Accuracy VerificationTo verify accuracy, we created a checklist to quantitatively score the summarization results, as presented in Table 2. This checklist was constructed by referencing the Past Medical History guidelines published by the US Centers for Medicare and Medicaid Services [20]. We separated the Resource Reference into Encounter units within each participant’s Bundle and established key accuracy verification criteria for resources such as Encounter, MedicationRequest, Observation, Condition, and AllergyIntolerance. These criteria were then organized into a checklist (Table 3). The grading criteria were determined based on the importance criteria from [20] and the Step 2 exam criteria of the United States Medical Licensing Examination (USMLE) [21]. 4. Implementation of the Prototype and ValidationThe prototype was developed according to the designed logic, and its feasibility was validated. Fig. 4 illustrates the operation sequence for the prototype. Table 2. Summary grading checklist criteria by each resource
Table 3. Evaluation score sheet based on summary grading checklist criteria
Fig. 4. System architecture for the prototype of the proposed medical information summarizing service Upon examining the detailed procedure, user requests were received via the web server and processed by the web application server (WAS). The WAS received FHIR Bundle data through the FHIR server and executed the designed data refinement and preprocessing processes. The processed request was then forwarded to the AI Request Server, where the results were generated. Finally, the results were returned to the user. The implementation environment for the prototype is summarized in Table 4. Table 4. Implementation environment for the prototype of the medical information summarization service
4.1 Data RefiningTo extract key elements from US-Core 6.1.0-based FHIR resources consisting of bundles, an entity list was created that defined the key extraction elements and their locations for each resource. Table 5 provides an example of the list of key elements and their locations among the entities in the Condition resource structure. When the key elements and their locations were defined for each resource, the library was converted through a logic that transformed the resources into Pandas-based DataFrames, enabling them to be divided into chunks by entity. To accomplish this, JMESPath was employed to construct ontology-based transformation expressions that converted the data into a single row for each resource, as listed in the example in Fig. 5. Table 5. Core entity names and locations of the medical information of condition resource
4.2 Data PreprocessingTo utilize the DataFrames transformed during the data refining process as data summaries for LLMs, verifying the suitability of the data and adding any missing elements, such as reference values, is necessary. Additionally, a process that converts the DataFrame into a single string is required. After receiving the refined data, verification and supplementation are performed first. In this study, publicly available FHIR resources [19], provided by SMART on FHIR, were used as experimental data. The dataset consisted of 631 files with a total size of 764 Mbytes and an average file size of 1.21 Mbytes. The size of each file ranged from 184 kbytes to 7.5 Mbytes, containing medical data for each patient, which formed the LTD dataset. For the prototype verification of this study, the following processes were performed: verification of the FHIR-based elements, merging of resources that utilized alternative elements owing to the absence of code-type data, and elimination of resources without references between them. Additionally, preprocessing was performed on Observation Resource data where reference values were missing during medical tests and on MedicationRequest Resource data where daily dosage was not defined. After preprocessing, the DataFrame was divided into individual medical units. Key patient information, such as medication prescription, medical test, and surgery information, was then merged to generate a single string for each medical unit. 4.3 Composition of Learning Embedding Document RepositoryThe Embedding Document repository stores Embedding Documents. The AI Request API or framework utilizing this repository can call the corresponding database. For the development of the prototype, LangChain was used as the framework to call the AI API. The Embedding Document repository was constructed using ElasticSearch 8.1.5, and HayStack 2.2.4 was applied to utilize the configured Embedding Documents. Because the Embedding Document was designed to develop a prototype and summarize information across various patient groups rather than focusing on specific treatments, multiple data sources were utilized. These included the evaluation table in Table 2, a list of severe diseases (such as cancer and burns), rare and intractable disease data, the evaluation criteria for Korea's general health checkup results, Systematized Nomenclature of Medicine–Clinical Terms (SNOMED-CT) codes, Logical Observation Identifiers Names and Codes (LOINC) information, International Classification of Diseases (ICD-9) codes, and USMLE disease classification data. The accompanying image Fig. 6 illustrates the integration flowchart for Embedding Documents. A comprehensive retrieval of data is essential for various Embedding Documents, encompassing the level of detail and criteria of importance for each encounter. Consequently, the sub-resources of each encounter pertaining to the targeted patient were segmented into chunks, as depicted in Fig. 6, and the Embedding Framework was employed to search for each chunk. 4.4 Implementation of LLM-based Summarization ServiceTo implement the LLM-based summarization service, LangChain 0.1.0 was utilized and configured, and the APIs were configured using FastAPI 0.112.0 to function as a server. The API request was designed such that the temperature parameter can be passed as a separate parameter in the API request. This was used to adjust the hyperparameter value of the probability distribution when performing the validation. For cases where the input data contained an excessive amount of medical data for the API, the sliding window algorithm was utilized to handle these situations. Additionally, to ensure proper operation of the retrieval-augmented generation (RAG), the Embedding Document server was configured for installation. In this study, accuracy was prioritized when optimizing the preprocessed single string (SingleString) transmitted via JMESPath, ensuring that it could be summarized according to the criteria and content of the Embedding Document. To achieve this, the temperature was set to 0 during the API request process. To confirm this, the temperature value was adjusted between 0 and 1.0, targeting 275 patient encoders for the same 10 patients. Fig. 7 presents a graph that summarizes the verification results based on Table 3 and compares them with the scores. This setting was chosen because the summarization involved inferring key information from predefined data—such as Condition Codes, Observation Codesystems, and Results—rather than requiring any creative interpretation. 4.5 Operation and Execution of FHIR Data Summarization FunctionThe Bundles, which contained medical summaries for each of the 631 participants, were divided into separate Encounter units, and the summary results were stored as individual data points. To verify the data, each Bundle was input into the FHIR server and configured for retrieval. For compiling the summaries of each participant, the Bundles were first divided into Encounters for each participant, resulting in 19,990 Encounters. Medical information was then summarized for each Encounter. Subsequently, the participants' main symptoms were classified by reviewing the 19,990 summaries to determine the severity of their condition and the number of Encounters. The conditions (disease information) linked to the 4,759 Encounters (medical records) were categorized as severe, moderate, or unspecified, based on the domestic insurance classification standards [22]. In this context, “unspecified” refers to codes for patient symptoms, such as Appendicitis (uncertain appendicitis), rather than detailed medical diagnoses. All instances in which the same patient underwent multiple treatments were included, and the reliability of these results was corroborated at the practical level through the application of the t-distribution (Table 6). 4.6 Accuracy Verification of FHIR Data Summarization FunctionThe summary results for each treatment were quantified into scores based on the criteria outlined in Table 3. These treatment summary scores were then merged according to the participants’ disease types and averaged (Table 6). The classification categories represent the groupings for each participant. Participants with both severe and moderate diseases were classified as severe, whereas those with both moderate and unspecified diseases were classified as unspecified. Fig. 8 presents the density plot of these data. Table 6. Scores for summary results by disease category of the participants
Based on the evaluation of summary results using the prototype, as presented in Table 6, participants with moderate diseases history received a relatively high evaluation score of 85.4 points. Conversely, participants with a more extensive and severe medical history received a lower evaluation score of 71.2 points. Additionally, participants with unspecified diseases, who had considerably less data, achieved the highest evaluation score of 88.2 points. Despite achieving a total accuracy of 81.6%, participants with severe diseases and substantial medical records received relatively lower scores. The confidence interval anticipated to encompass the actual parameters for each category of severe disease, moderate disease, and unspecified conditions at a 95% confidence level corresponded to the value of the confidence interval item presented in Table 6. The mean difference between the Severe and Moderate Diseases groups was substantial. An independent t-test was conducted to examine the practical significance of this difference, yielding a t-statistic of 55.28 and a p-value of <0.001, which is below the significance threshold of α = 0.05. This indicates that the disparity in summary accuracy between the two groups was highly significant. Additionally, in the qualitative evaluation, certain results were overly generalized. However, the results validate that despite the heavy volume of data owing to the standardized FHIR structure, healthcare professionals can easily and efficiently review the primary diseases and conditions using the PHR data of each patient. A psychiatry professor suggested that better results could be obtained if the comprehensive clinical data were organized by specific departments rather than summarized, and the Embedding Documents were defined according to these criteria 5. ConclusionIn this study, we proposed a summarization technique for EMRs based on FHIR patient Bundles. The summarization process was designed to be practically applicable in clinical environments and requires no user interface, enabling healthcare professionals to incorporate patient-generated health data directly into existing EMR systems. We constructed a checklist based on clinical criteria to evaluate summary accuracy and confirmed the technical validity of our method through quantitative assessments. Our findings suggest that the proposed approach can improve both the interoperability of medical data systems and the efficiency of clinical decision-making, thus contributing to enhanced healthcare delivery. Despite these promising outcomes, several limitations must be addressed. First, although the summarization model handled FHIR-related overhead effectively, participants with complex medical histories still generated overly lengthy summaries, even after embedding key data criteria. This limitation highlights the constraints of using general-purpose language models, which are not fine-tuned for clinical context and may struggle with specialized medical terminology or decision relevance. Furthermore, our qualitative evaluation revealed that the summaries were sometimes overly generalized, lacking key clinical department-specific insights, which limits their applicability in domain-sensitive decision-making. These issues emphasize the need for fine-tuned, domain-adapted models and department-specific evaluation frameworks. To overcome these limitations, future research will explore fine-tuning strategies using diverse, domain-specific medical datasets. We plan to focus on selected clinical departments to enhance the relevance of embedded content and reduce summary redundancy. Additionally, rather than summarizing entire medical histories at once, we will investigate a question-driven summarization method that reflects how clinicians seek information during the care process. Building on this pilot study, future work will involve real-world clinical trials and usability assessments in collaboration with multidisciplinary medical teams. This will help establish a robust framework for evaluating summarization models in practical, diverse clinical settings and ultimately improve their reliability and adoption in healthcare. BiographyNam-Gyu Leehttps://orcid.org/0009-0006-5493-8039He received a bachelor’s degree at Hanbat National University of Korea. He studied in the department of IT Convergence SW Engineering, Korea University of Technology and Education. Currently, he works in the Healthcare Platform Department at PHI Digital Healthcare. His research primarily focuses on software engineering, medical IT systems, and the integration of generative AI into real-world clinical applications. BiographySeung-Hee Kimhttps://orcid.org/0000-0001-6312-9486She received her bachelor’s degree from Dongguk University, her master’s degree from Yonsei University, and her doctoral degree in Industrial Information Systems from the Graduate School of IT Policy at Seoul National University of Science and Technology, Korea. She is currently an associate professor at Korea University of Technology and Education. Her primary research interests include software engineering, AI quality engineering, blockchain technologies, and project management based on generative AI. References
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||