Mapping interconnectivity of digital twin healthcare research themes through structural topic modeling

Topic selection
Figure 2 shows how metrics changed with the number of topics. As the number of topics increased, held-out likelihood and lower bound improved, while residuals decreased, indicating better model performance. However, semantic coherence declined, suggesting that many topics reduce interpretability. A balance was observed at seven or eight topics, where coherence remained stable and other metrics continued to improve. The seven-topic model reflected broad healthcare challenges, while the eight-topic model revealed more specific themes like ethics and stability, with greater semantic independence, making it preferable.

Metric values according to the number of topics.
Topic labels
Table 1 presents the high-probability words and the assigned name for each topic, based on term coherence and literature review. Topic 1 focused on the architecture of the information technology infrastructure, encompassing data networks and cloud computing frameworks. Terms such as “network,” “device,” and “architecture” were related to data management systems and cloud computing infrastructure components. Topic 2 emphasized ML paradigms, algorithmic methodologies, and model application purposes. Terms including “learn,” “model,” and “algorithm” reflected advancements in learning frameworks and analytical techniques. Topic 3 centered on metaverse and virtual reality (VR) technologies, with lexical items such as “metaverse,” “virtual,” and “immersive” highlighting experiential dimensions in digital healthcare environments. Topics 4 and 5 addressed personalized medicine in digital healthcare and security solutions in healthcare workflows, respectively. Topic 6 focused on digital transformation in manufacturing sectors, with terminology including “manufacture,” “industrial,” and “transformation” signifying technological advancements related to smart manufacturing and Industry 4.0 initiatives. Topic 7 addressed robotic systems predicated on human-centered design principles, while Topic 8 was related to ethical considerations and future trajectories in healthcare research.

Tree maps of the topics showing those with the highest FREX scores.
We employed FREX scores for comprehensive topic interpretations, visualized in Fig. 3 as a tree map of high-scoring terms for each topic. The FREX methodology balances word frequency and exclusivity parameters to identify terminology that optimally represents thematic content through both its prominence and distinctiveness. The analysis of Topic 1 revealed distinctive terms including “compute,” “cloud,” and “infrastructure,” emphasizing characteristics of cloud computing. Topic 2 featured terms such as “accuracy,” “algorithm,” “detection,” and “prediction,” highlighting concepts used to evaluate ML models. Topic 3 incorporated “reality,” “education,” and “experience,” reflecting research in immersive metaverse applications. Topic 4 included “patient,” “clinical,” and “personalized,” indicating a focus on personalized medicine. Topic 5 emphasized terminology related to secure medical data transmission, focusing on healthcare communications security. Topics 6 and 7 illustrated DT implementation scenarios in industrial contexts, while Topic 8 incorporated “literature,” “ethical,” and “review,” demonstrating a focus on ethical considerations in healthcare research. The thematic content was initially derived from the highest-probability words, with refinement through the FREX analysis. The topic classification and labeling were validated in consultation with two domain experts in healthcare, DTs, and informatics.
Topic proportion
Figure 4 visualizes the distribution of topic proportions as a bubble chart. Topic 4 focused on personalized medicine in digital healthcare, which constituted 0.17 of the corpus, establishing it as the most-prominent and comprehensively addressed thematic domain in the data set. Topic 8, which explored ethical considerations and future research trajectories, represented 0.15 of the corpus and was the second-most-common topic in the analyzed literature. Conversely, Topics 1 and 5 represented the smallest proportions of the corpus, reflecting that these domains have received less attention in previous healthcare DT research.

Bubble chart of the topic proportions.

Topic proportions according to year.

Radar chart of the relationships between topics.
Figure 5 visualizes proportional temporal shifts in the topic distribution. In each panel, the solid red line represents the estimated temporal trajectory of the topic prevalence across publication years, derived from the STM incorporating year as a covariate. The dashed red lines denotes the pointwise 95% confidence interval bounds associated with this estimate, reflecting the uncertainty at each time point. Topics 2, 3, and 8 demonstrated consistent upward trajectories, indicating progressively increasing scholarly attention in the healthcare DT literature. Conversely, Topics 4 and 6 exhibited discernible declining trends in their proportions over the analyzed period, which suggests a redirection of research focus away from previously dominant domains. Topics 1, 5, and 7 maintained relatively constant proportions throughout the analyzed period, indicating sustained-neither increasing nor decreasing-scholarly interest in these thematic areas. These temporal fluctuations in the topic proportions provide valuable insights into the evolutionary trajectory of research priorities in the healthcare DT domain, highlighting shifting paradigms and emerging areas of scholarly focus.
Topic comparisons
Eight keywords were selected from the comprehensive topic set to analyze intertopic relationships, with associated weights calculated using the FREX and appearance-probability metrics. The polygons in the radar chart in Fig. 6 reflect both the magnitude and nature of the word-topic relationships.
The analyzes demonstrated that the identified topics formed a complex network of interrelationships rather than existing as isolated entities. The data (network, device, and datum) and learning (model, learn, and algorithm) topics shared fundamental technical foundations, with infrastructure supporting data acquisition enabling the development of ML models. This foundation connects to the medicine (personalized medicine and treatment) topic, highlighting the role of data-driven analytics in healthcare contexts, exemplified by patient-specific data facilitating customized therapeutic interventions. The metaverse (metaverse and reality) and industry (smart and industry) topics exhibited strong interconnectivity regarding technological implementation. Immersive platforms in the metaverse environments supported industrial smart manufacturing capabilities and digital transformation initiatives. These domains were connected to the systems (robot and sensor) topic, particularly regarding technologies that simultaneously support metaverse physical interaction and industrial automation. The security (security and blockchain) topic includes critical elements connected across all domains that protect the integrity of medical information, ensure automation reliability, and support virtual-environment functionality. Consequently, security considerations span both data-centric platforms and their application domains. The research (research and future) topic constitutes foundational elements underlying all thematic areas. Technological advancements are driven by continuous research initiatives, while ethical considerations provide directional guidance for development trajectories, exemplified by ethical medical data utilization emerging as significant in both personalized medicine and security research frameworks.
Topic correlations
We visualized topic correlations based on the topic modeling results to examine thematic interrelationships. The network graph in Fig. 7 represents topics as nodes, with connecting edges indicating positive correlations between them. The edges are annotated with correlation coefficients quantifying the thematic associations, while the size and color of each node reflects the topic centrality and significance. The “security solutions to improve data processes and communication in healthcare” topic occupies the central network position with multiple connections, highlighting its fundamental importance. This security-focused topic was strongly correlated (coefficient = 0.41) with the “cloud computing and data network architecture” topic, indicating substantial technical interdependence. The “cloud computing and data network architecture” and “machine-learning algorithms for accurate detection and prediction” topics function as bridging elements between technical domains and medical applications, suggesting significant cross-domain utility potential and facilitating technological implementation in healthcare contexts.

Correlations between the topics.
link