TAP-Fusion: Designing Fair Multi-Modal AI for Equitable Alzheimer’s Risk Detection
- Alina Huang
- Oct 15
- 3 min read
TAP-Fusion — A Multi-Modal Deep Learning Framework for Alzheimer’s Disease Risk Detection Using Clinical, Imaging, Speech, and Textual Data
Independent research by Anton Goncear
Abstract
Alzheimer’s disease (AD) is a progressive neurodegenerative disorder, with diagnosis often delayed until significant cognitive decline has occurred. Current AI systems typically focus on a single modality (e.g., imaging, speech, or clinical data), limiting their ability to capture the full complexity of AD symptoms.
In this research, we propose TAP-Fusion, a prototype framework that integrates Tabular EHR data, Brain Imaging, Speech Audio, and Clinical Notes into a unified predictive model. The system leverages pretrained encoders—ResNet18 for images, DistilBERT for text, MFCC-based CNN pooling for audio, and an MLP for tabular features—fused through a learnable classifier. We also introduce integral-based mathematical features to enhance prediction by summarizing signals across modalities.
Evaluation on synthetic multimodal datasets demonstrates the feasibility of TAP-Fusion for early detection of Alzheimer’s risk in clinical settings, and lays the groundwork for future deployment on real patient data.
Introduction
Alzheimer’s disease diagnosis remains challenging due to:
Fragmented data sources – Imaging (MRI, PET), clinical notes, lab tests, and speech analysis are rarely combined.
Late diagnosis – Early-stage symptoms such as word-finding pauses and microstructural brain changes are subtle and often missed.
Limited generalization – AI models trained on a single modality often fail to generalize across patient populations.
Recent research highlights the potential of multi-modal fusion. TAP-Fusion builds on this idea, providing a modular AI framework that can:
Detect speech anomalies (pauses, slower speech rate)
Capture structural brain patterns via pretrained CNNs
Incorporate EHR features (biomarkers, lab tests, demographics)
Interpret clinical text through transformer embeddings
Methodology
Architecture Overview
Image Encoder: ResNet18 pretrained on ImageNet (final FC layer replaced with identity)
Audio Encoder: MFCC transformation with mean/std pooling
Text Encoder: DistilBERT mean-pooled sentence embeddings
Tabular Encoder: 2-layer MLP
Fusion Layer: Concatenation of all feature vectors + integral-based features → dense classifier
Integral-based Features
To enhance model performance, simple mathematical transformations summarize signals across modalities:
from scipy.integrate import simps
# Tabular integral: area under feature curve
def integral_tab(x):
area = simps(x.cpu().numpy())
return torch.tensor([[area]], device=x.device).float()
# Audio integral: overall energy
def integral_audio(x):
return x.sum(dim=1, keepdim=True)These features are appended to the fusion vector, giving the classifier additional summary statistics.
Fusion Model
class TAPFusionModelWithIntegral(nn.Module):
def __init__(self, tab_in_dim, tab_out=64, img_dim=512, audio_dim=80,
text_dim=768, hidden=256, num_classes=2):
super().__init__()
self.tab_encoder = TabularEncoder(tab_in_dim, tab_out)
fusion_dim = tab_out + img_dim + audio_dim + text_dim + 2 # +2 for integrals
self.classifier = nn.Sequential(
nn.Linear(fusion_dim, hidden),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(hidden, hidden//2),
nn.ReLU(),
nn.Linear(hidden//2, num_classes)
)
def forward(self, tab_x, img_x, audio_x, text_x):
tab_feat = self.tab_encoder(tab_x)
tab_int = integral_tab(tab_x)
audio_int = integral_audio(audio_x)
x = torch.cat([tab_feat, img_x, audio_x, text_x, tab_int, audio_int], dim=1)
return self.classifier(x)Prototype Dataset
Synthetic data simulate multi-modal inputs:
Random tabular values
Random brain-like images
Sinusoidal audio signals
Template clinical notes
Binary labels: 0 = healthy, 1 = Alzheimer risk (~30% positive)
Training & Evaluation
Loss: CrossEntropy
Optimizer: Adam (lr=1e-4)
Split: 80/20 train/validation
Metrics: Accuracy
Observation: Even crude integral features slightly adjust model probabilities, demonstrating the potential of mathematical augmentations.
Example Inference (Brain Image + Dummy Features)
img_feat = encode_image_integral(img_pil) # integral-enhanced image feature
probs = F.softmax(model(demo_tab, img_feat, demo_audio, demo_text), dim=1).cpu().numpy()[0]
print("Prediction probabilities:", {"healthy": probs[0], "alzheimer_risk": probs[1]})Results:
Used test image:
=== Original Model ===
healthy: 0.446
alzheimer_risk: 0.554
=== Integral-enhanced Model ===
healthy: 0.450
alzheimer_risk: 0.550
Even small shifts show that integral-based features can modulate predictions and potentially improve accuracy.
Conclusion
TAP-Fusion demonstrates a multi-modal, mathematically augmented deep learning framework for early Alzheimer’s risk detection. By integrating imaging, audio, text, and tabular EHR data, and incorporating simple integral-based features, it provides a scalable, interpretable, and extensible model suitable for real-world clinical research.
Future work includes:
Applying TAP-Fusion to real patient datasets
Incorporating longitudinal data
Using explainability techniques (e.g. SHAP, Grad-CAM) to highlight predictive modalities and features


