Lightweight Multimodal Emotion Recognition Using Cross-Dataset Feature Fusion of Text and Facial Expressions

Mishra, Susrita; Patnaik, Phalguni; Patnaik, Samikhya; Singh, Ujjwal; Kar, Santosh Kumar; Panda, Bandhan

doi:https://doi.org/10.55041/ijcope.v2i3.142

Volume 02, Issue 03

Published on: March 2026 2026

LIGHTWEIGHT MULTIMODAL EMOTION RECOGNITION USING CROSS-DATASET FEATURE FUSION OF TEXT AND FACIAL EXPRESSIONS

Susrita Mishra Phalguni Patnaik Samikhya Patnaik Ujjwal Singh Santosh Kumar Kar Bandhan Panda

Dept. of Computer Science & Engineering NIST University Berhampur India

DOI:https://doi.org/10.55041/ijcope.v2i3.142

Article Status

Plagiarism Passed Peer Reviewed Open Access

Available Documents

Download PDF Review Report

Abstract

Emotion recognition has become a significant undertaking in affective computing that makes intelligent systems recognise and act upon human emotions. It is well involved in human-computer interaction, healthcare monitoring, and smart virtual assistants. Most conventional emotion recognition systems currently are based on a single expressive system, e.g., textual affect or facial expression, and can only be used to a limited degree to understand the depth and multi-proponent characteristics of human feelings. To address this shortcoming, this paper proposes a multimodal emotion recognition framework that integrates text and visual cues at the deep feature level. The proposed system utilises DistilBERT to extract textual representations of the context of the ISEAR dataset and EfficientNet-B3 to extract facial expression features of the FER2013 and RAF-DB datasets. Because the textual and visual data are sampled across datasets, a cross-dataset pairing strategy is proposed to form multimodal training samples by matching textual descriptions to facial images with the matching emotion labels. The obtained features are fused through a feature fusion mechanism that is a gated one and fed into a Long Short-Term Memory (LSTM) classifier. Experimental findings indicate that the proposed multimodal model has an accuracy of 82%, performing better than the text-only model (61%) and the image-only model (69%), which proves the applicability of multimodal emotion recognition that was cross-dataset trained.

How to Cite this Paper

Mishra, S., Patnaik, P., Patnaik, S., Singh, U., Kar, S. K. & Panda, B. (2026). Lightweight Multimodal Emotion Recognition Using Cross-Dataset Feature Fusion of Text and Facial Expressions. International Journal of Creative and Open Research in Engineering and Management, <i>02</i>(03). https://doi.org/10.55041/ijcope.v2i3.142

Mishra, Susrita, et al.. "Lightweight Multimodal Emotion Recognition Using Cross-Dataset Feature Fusion of Text and Facial Expressions." International Journal of Creative and Open Research in Engineering and Management, vol. 02, no. 03, 2026, pp. . doi:https://doi.org/10.55041/ijcope.v2i3.142.

Mishra, Susrita,Phalguni Patnaik,Samikhya Patnaik,Ujjwal Singh,Santosh Kar, and Bandhan Panda. "Lightweight Multimodal Emotion Recognition Using Cross-Dataset Feature Fusion of Text and Facial Expressions." International Journal of Creative and Open Research in Engineering and Management 02, no. 03 (2026). https://doi.org/https://doi.org/10.55041/ijcope.v2i3.142.

Search & Index

References

M. Wafa, M. M. Eldefrawi, and M. S. Farhan, “Advancing multimodal emotion recognition in big data through prompt engineering and deep adaptive learning,” Journal of Big Data, vol. 12, no. 210, 2025, doi: 10.1186/s40537-025-01264-w.

El Maazouzi and A. Retbi, “Multimodal detection of emotional and cognitive states in e-learning through deep fusion of visual and textual data with NLP,” Computers, vol. 14, no. 314, 2025, doi: 10.3390/computers14080314.

Poria, E. Cambria, R. Bajpai, and A. Hussain, “A review of affective computing: From unimodal analysis to multimodal fusion,” Information Fusion, vol. 37, pp. 98–125, 2017, doi: 10.1016/j.inffus.2017.02.003.

Baltrušaitis, C. Ahuja, and L. Morency, “Multimodal machine learning: A survey and taxonomy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 2, pp. 423–443, 2019, doi: 10.1109/TPAMI.2018.2798607.

Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. NAACL-HLT, 2019, doi: 10.18653/v1/N19-1423.

Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter,” in Proc. NeurIPS Workshop, 2019, doi: 10.48550/arXiv.1910.01108.

Tan and Q. Le, “EfficientNet: Rethinking model scaling for convolutional neural networks,” in Proc. ICML, 2019, doi: 10.48550/arXiv.1905.11946.

Li and W. Deng, “Deep facial expression recognition: A survey,” IEEE Transactions on Affective Computing, vol. 13, no. 3, pp. 1195–1215, 2022, doi: 10.1109/TAFFC.2020.2981446.

Mollahosseini, D. Hasani, and M. H. Mahoor, “AffectNet: A database for facial expression, valence, and arousal computing in the wild,” IEEE Transactions on Affective Computing, vol. 10, no. 1, pp. 18–31, 2019, doi: 10.1109/TAFFC.2017.2740923.

J. Goodfellow et al., “Challenges in representation learning: A report on three machine learning contests,” in Proc. Int. Conf. Neural Information Processing, 2013, doi: 10.1007/978-3-642-42051-1_16.

Ethical Compliance & Review Process

•All submissions are screened under plagiarism detection.
•Review follows editorial policy.
•Authors retain copyright.
•Peer Review Type: Double-Blind Peer Review
•Published on: Mar 26 2026

CCBYNC

This article is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. You are free to share and adapt this work for non-commercial purposes with proper attribution.

View License

Back to Volume 02, Issue 03 View All Issues Next Article

← Previous Article

Leveraging Technology for Improved Service Delivery: Anganwadi Workers in Malappuram Bridge Digital divide to Utilize Communication Technologies

Next Article →

LIN BUS Formulation using Optical Fiber