Multi-modal Learning | Suraj Shourie

We use multimodal (image + text) learning to overcome the language disparity that exists between English and non-English languages. The figure illustrates an example of a social media post that is correctly classified in English but misclassified in Spanish. Including the corresponding image leads to correct classification in Spanish as well as other non-English languages

This project was done as part of the course CS 7643 Deep Learning at Georgia Tech.

We investigated different multi-modal deep learning architectures with a focus on bridging performance gap in classification in English language vs non-English (Hindi in our study) lanugage tasks. The term multi-modal here means that we used both textual and image based inputs (modes) while training our classification model.

Our aim was to:

Conduct a comparative analysis of different text and image networks, to understand which architectures work best.
Understand how much images can improve the discrepancy in performance between English and non-English text classification models.

We used CrisisMMD dataset for training and testing our models.

You can read our full report in the github repository, but to be brief we:

Trained a ResNet-18, VGG network, and Inception-V3 to compare different deep learning architectures and to compute image embeddings.
Fine-Tuned two pre-trained BERT models, DistilmBERT (distilbert-base-multilingual-cased on HuggingFace) to classify the English text and HindiBERT for Hindi (Doiron 2020).
Finally we implement a multimodal or fusion classifier that combines the embeddings of both text and image data to perform classification based on the joint modeling of both input modalities.