top of page

TAP-Fusion: Designing Fair Multi-Modal AI for Equitable Alzheimer’s Risk Detection

TAP-Fusion — A Multi-Modal Deep Learning Framework for Alzheimer’s Disease Risk Detection Using Clinical, Imaging, Speech, and Textual Data

Independent research by Anton Goncear

Abstract

Alzheimer’s disease (AD) is a progressive neurodegenerative disorder, with diagnosis often delayed until significant cognitive decline has occurred. Current AI systems typically focus on a single modality (e.g., imaging, speech, or clinical data), limiting their ability to capture the full complexity of AD symptoms.


In this research, we propose TAP-Fusion, a prototype framework that integrates Tabular EHR data, Brain Imaging, Speech Audio, and Clinical Notes into a unified predictive model. The system leverages pretrained encoders—ResNet18 for images, DistilBERT for text, MFCC-based CNN pooling for audio, and an MLP for tabular features—fused through a learnable classifier. We also introduce integral-based mathematical features to enhance prediction by summarizing signals across modalities.


Evaluation on synthetic multimodal datasets demonstrates the feasibility of TAP-Fusion for early detection of Alzheimer’s risk in clinical settings, and lays the groundwork for future deployment on real patient data.


Introduction

Alzheimer’s disease diagnosis remains challenging due to:

  • Fragmented data sources – Imaging (MRI, PET), clinical notes, lab tests, and speech analysis are rarely combined.

  • Late diagnosis – Early-stage symptoms such as word-finding pauses and microstructural brain changes are subtle and often missed.

  • Limited generalization – AI models trained on a single modality often fail to generalize across patient populations.


Recent research highlights the potential of multi-modal fusion. TAP-Fusion builds on this idea, providing a modular AI framework that can:

  • Detect speech anomalies (pauses, slower speech rate)

  • Capture structural brain patterns via pretrained CNNs

  • Incorporate EHR features (biomarkers, lab tests, demographics)

  • Interpret clinical text through transformer embeddings


Methodology
Architecture Overview
  • Image Encoder: ResNet18 pretrained on ImageNet (final FC layer replaced with identity)

  • Audio Encoder: MFCC transformation with mean/std pooling

  • Text Encoder: DistilBERT mean-pooled sentence embeddings

  • Tabular Encoder: 2-layer MLP

  • Fusion Layer: Concatenation of all feature vectors + integral-based features → dense classifier


Integral-based Features

To enhance model performance, simple mathematical transformations summarize signals across modalities:

from scipy.integrate import simps

# Tabular integral: area under feature curve
def integral_tab(x):
    area = simps(x.cpu().numpy())
    return torch.tensor([[area]], device=x.device).float()

# Audio integral: overall energy
def integral_audio(x):
    return x.sum(dim=1, keepdim=True)

These features are appended to the fusion vector, giving the classifier additional summary statistics.


Fusion Model
class TAPFusionModelWithIntegral(nn.Module):
    def __init__(self, tab_in_dim, tab_out=64, img_dim=512, audio_dim=80,
                 text_dim=768, hidden=256, num_classes=2):
        super().__init__()
        self.tab_encoder = TabularEncoder(tab_in_dim, tab_out)
        fusion_dim = tab_out + img_dim + audio_dim + text_dim + 2  # +2 for integrals
        self.classifier = nn.Sequential(
            nn.Linear(fusion_dim, hidden),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(hidden, hidden//2),
            nn.ReLU(),
            nn.Linear(hidden//2, num_classes)
        )

    def forward(self, tab_x, img_x, audio_x, text_x):
        tab_feat = self.tab_encoder(tab_x)
        tab_int = integral_tab(tab_x)
        audio_int = integral_audio(audio_x)
        x = torch.cat([tab_feat, img_x, audio_x, text_x, tab_int, audio_int], dim=1)
        return self.classifier(x)
Prototype Dataset

Synthetic data simulate multi-modal inputs:


  • Random tabular values

  • Random brain-like images

  • Sinusoidal audio signals

  • Template clinical notes

  • Binary labels: 0 = healthy, 1 = Alzheimer risk (~30% positive)


Training & Evaluation
  • Loss: CrossEntropy

  • Optimizer: Adam (lr=1e-4)

  • Split: 80/20 train/validation

  • Metrics: Accuracy


Observation: Even crude integral features slightly adjust model probabilities, demonstrating the potential of mathematical augmentations.


Example Inference (Brain Image + Dummy Features)
img_feat = encode_image_integral(img_pil)  # integral-enhanced image feature
probs = F.softmax(model(demo_tab, img_feat, demo_audio, demo_text), dim=1).cpu().numpy()[0]
print("Prediction probabilities:", {"healthy": probs[0], "alzheimer_risk": probs[1]})

Results:

Used test image:


=== Original Model ===


  • healthy: 0.446

  • alzheimer_risk: 0.554



=== Integral-enhanced Model ===


  • healthy: 0.450

  • alzheimer_risk: 0.550


Even small shifts show that integral-based features can modulate predictions and potentially improve accuracy.


Conclusion

TAP-Fusion demonstrates a multi-modal, mathematically augmented deep learning framework for early Alzheimer’s risk detection. By integrating imaging, audio, text, and tabular EHR data, and incorporating simple integral-based features, it provides a scalable, interpretable, and extensible model suitable for real-world clinical research.


Future work includes:


  • Applying TAP-Fusion to real patient datasets

  • Incorporating longitudinal data

  • Using explainability techniques (e.g. SHAP, Grad-CAM) to highlight predictive modalities and features



bottom of page