Multimodal predictive models: A practical approach in medicine and education

Key takeways

HAIM improves medical predictions by combining diverse patient data.

MuDoC enhances learning by merging text and visuals.

Multimodal AI still faces technical, privacy, and interpretability challenges.

In the first part of this article, we explored the fundamentals of Multimodal Artificial Intelligence: what it is, how it works, and the main methods of data fusion. We looked at how multimodal systems can integrate different modalities — such as text, image, speech, or biometric signals — to create richer, more accurate, and more contextualised interactions. We also examined the technical challenges and advantages of this approach, which aims to bring AI’s perception and response capabilities closer to the way humans interact with the world.

In this second part, we will explore the role of multimodal AI in healthcare and education. Both sectors are being progressively transformed by AI, which is improving patient care and personalising learning. We will examine two examples to understand the benefits and challenges of implementing multimodal AI in these fields.

Use Case 1: Multimodal AI in Healthcare

Healthcare produces huge and diverse data in different formats, such as medical images, clinical notes, lab tests, and patient records. The combination of these diverse data types can offer the potential for a more holistic view of patient condition. Multimodal AI engines are engineered to digest and integrate these multiple data sources, resulting in better diagnoses and individualized treatment plans.

One example of how multimodal AI is used in healthcare is the Holistic AI in Medicine (HAIM) framework. HAIM simulates a diverse set with different types of data (e.g., EHR, medical imaging, and clinical notes) to increase predictive model learning in healthcare. By integrating these three types of datasets, HAIM has shown better results for various tasks, including disease identification and patient outcome prediction. The average percent improvement of all multimodal HAIM predictive systems is 9–28% across all evaluated tasks (Integrated multimodal artificial intelligence framework for healthcare applications | npj Digital Medicine)

HAIM combines data from multiple sources to create comprehensive patient profiles. Each profile includes structured data like demographics, lab results, and medication records; time-series data such as vital signs and other chronological measurements; unstructured text like clinical notes and reports; and medical images, including chest X-rays and associated imaging data. Each data type is processed separately to create numerical representations, known as embeddings:

Structured data is normalized and transformed into numerical values.
Time-series data is analyzed using statistical metrics to represent trends over time.
Text data is processed using pre-trained transformer models to produce fixed-size embeddings.
Image data is analyzed with pre-trained convolutional neural networks to extract feature embeddings.

The individual embeddings from each modality are concatenated to form a comprehensive fusion embedding. This unified representation serves as input for predictive models, such as XGBoost, to perform tasks like disease diagnosis and patient outcome prediction.

Fig. 1: Integrated multimodal artificial intelligence framework for healthcare applications | npj Digital Medicine

Benefits:

Integrates diverse data modalities, creating more comprehensive patient profiles.
Consistently outperforms single-modality models, with improvements of 9% to 28% in healthcare tasks.
Supports various applications, including disease diagnosis and patient outcome prediction.
Modular design enables the addition of new data types, enhancing adaptability and scalability in clinical settings.

Challenges:

Requires sophisticated preprocessing and normalization to ensure compatibility across diverse data types.
Computational complexity can be resource-intensive, raising scalability concerns.
Demands stringent privacy and data security measures due to sensitive patient information.
Model interpretability remains challenging, affecting clinical trust and adoption.

Use Case 2: Multimodal AI in Education

Education is a natural fit for multimodal AI because learning materials often include a mix of text, images, graphs, and diagrams. Traditional educational AI tools have mainly worked with text, but by incorporating other forms of content (visual and interactive elements), multimodal systems can better reflect how humans learn. This results in more engaging and effective educational experiences that are tailored to diverse learning styles.

One of the most promising examples of this approach is the MuDoC system (Multimodal Document-grounded Conversational AI). MuDoC is designed to support learners by combining natural language processing and computer vision to analyse educational materials, including written text and visual elements. When a student asks a question, the system doesn’t just respond with plain text. Instead, it scans the source material, retrieves the relevant section, and provides a response that integrates the necessary text and images from the original document. This helps learners build stronger mental models and verify the AI’s answers directly in the learning materials, building transparency and trust.

Technically, MuDoC uses a language model (like GPT-4o) to process and generate natural language answers. At the same time, it applies computer vision techniques to parse visual content (such as diagrams, figures, and illustrations) embedded in learning documents. The system maps these different content types into a unified representation that allows it to select and combine them contextually. This process results in rich, grounded answers that go beyond what purely text-based AI systems can deliver. It creates a dynamic learning assistant that not only explains but also shows, supporting better understanding of complex subjects.

Fig 2. [2504.13884] Towards a Multimodal Document-grounded Conversational AI System for Education

Benefits:

Gets students interested and involved by making learning engaging.
Combines words and pictures to communicate information effectively.
Simplifies complex concepts like physics, biology, and math through visuals.
Enhances trust with clear visibility of answer sources.
Encourages deeper learning by inspiring curiosity.

Challenges:

Aligning words and pictures perfectly can be challenging, and mismatched visuals can cause confusion.
Ensuring accessibility for all students, including those with visual or learning difficulties, is essential.
Managing the simultaneous use of words and pictures requires significant computing power.

Conclusion

In summary, multimodal AI is transforming how machines understand and interact with the world by combining data from multiple sources like text, images, speech, and time-series signals.

The HAIM framework leverages this approach to create comprehensive patient profiles in healthcare, achieving performance improvements, including disease diagnosis and outcome prediction. However, it faces challenges like the need for sophisticated data preprocessing, high computational demands, stringent privacy measures, and model interpretability, which are critical for clinical trust and scalability.

Similarly, in education, the MuDoC system uses multimodal AI to enhance student engagement, making learning more accessible and understandable through a combination of words and images. Yet, it must overcome challenges in aligning text and visuals accurately, ensuring accessibility for all learners, and managing high computational requirements.

As seen in the HAIM framework and in the MuDoC system, this approach enables more accurate predictions, deeper insights, and better user experiences. While challenges remain, the potential of multimodal AI to enhance decision-making, personalize experiences, and align more closely with human communication makes it a vital direction for the future of artificial intelligence.

Author

Marta Carreira

Consultant

Multimodal predictive models: A practical approach in medicine and education

Fabric: nova plataforma de análise de dados

Multimodal predictive models: A practical approach in medicine and education

Key takeways

HAIM improves medical predictions by combining diverse patient data.

MuDoC enhances learning by merging text and visuals.

Multimodal AI still faces technical, privacy, and interpretability challenges.

Author

Marta Carreira

Data sovereignty: the strategic asset for businesses

Modern Anomaly Detection: Techniques, Challenges, and Ethical Considerations

Optimising Performance in Microsoft Fabric Without Exceeding Capacity Limits

Metadata Frameworks in Microsoft Fabric: YAML Deployments (Part 3)

Metadata Frameworks in Microsoft Fabric: Logging with Eventhouse (Part 2)

Simplifying Metadata Frameworks in Microsoft Fabric with YAML

Multimodal predictive models: A practical approach in medicine and education

Fabric: nova plataforma de análise de dados

Multimodal predictive models: A practical approach in medicine and education

Key takeways

HAIM improves medical predictions by combining diverse patient data.

MuDoC enhances learning by merging text and visuals.

Multimodal AI still faces technical, privacy, and interpretability challenges.

Author

Marta Carreira

Share

Suggested Content

Data sovereignty: the strategic asset for businesses

Modern Anomaly Detection: Techniques, Challenges, and Ethical Considerations

Optimising Performance in Microsoft Fabric Without Exceeding Capacity Limits

Metadata Frameworks in Microsoft Fabric: YAML Deployments (Part 3)

Metadata Frameworks in Microsoft Fabric: Logging with Eventhouse (Part 2)

Simplifying Metadata Frameworks in Microsoft Fabric with YAML