An Automated Image Captioning Framework Based on Vision Transformers and LSTM Networks

Praneesh, Dr M; Napoleon, Dr.D.

doi:https://doi.org/10.55041/ijcope.v2i5.850

Published on: May 2026

AN AUTOMATED IMAGE CAPTIONING FRAMEWORK BASED ON VISION TRANSFORMERS AND LSTM NETWORKS

Dr M Praneesh Dr.D. Napoleon

Department of Computer Science with Data Analytics / Sri Ramakrishna College of Arts & Science / Bharathiar University, Coimbatore, India

DOI:https://doi.org/10.55041/ijcope.v2i5.850

Article Status

Plagiarism Passed Peer Reviewed Open Access

Available Documents

Download PDF Review Report

Abstract

Image captioning is an important research area in artificial intelligence that integrates computer vision and natural language processing (NLP) to automatically generate descriptive textual interpretations of images. Conventional image captioning systems typically employ Convolutional Neural Networks (CNNs) for visual feature extraction and Long Short-Term Memory (LSTM) networks for generating sequential text descriptions. Although effective, CNN-based approaches may have limitations in capturing global contextual relationships within images.

This research introduces an improved image captioning framework that utilizes Vision Transformers (ViTs) as the feature extraction backbone instead of traditional CNN architectures. By leveraging self-attention mechanisms, Vision Transformers can effectively model long-range dependencies and capture comprehensive contextual information from visual data. The extracted image representations are subsequently provided to an LSTM network, which generates coherent and meaningful captions in a sequential manner.

The proposed model is evaluated using widely accepted image captioning performance metrics, including BLEU and METEOR scores. Experimental findings indicate that the Vision Transformer-based approach produces more accurate, descriptive, and context-aware captions compared to conventional CNN-LSTM models. The enhanced caption generation capability of the proposed framework makes it suitable for various real-world applications, including assistive technologies for visually impaired individuals, automated image annotation, content management systems, and intelligent multimedia retrieval.

Keywords— Image Captioning, Vision Transformer (ViT), Long Short-Term Memory (LSTM), Natural Language Processing (NLP), Computer Vision, Text Generation, BLEU Score

How to Cite this Paper

Praneesh, D. M. & Napoleon, D. (2026). An Automated Image Captioning Framework Based on Vision Transformers and LSTM Networks. International Journal of Creative and Open Research in Engineering and Management, <i>02</i>(05). https://doi.org/10.55041/ijcope.v2i5.850

Praneesh, Dr, and D. Napoleon. "An Automated Image Captioning Framework Based on Vision Transformers and LSTM Networks." International Journal of Creative and Open Research in Engineering and Management, vol. 02, no. 05, 2026, pp. . doi:https://doi.org/10.55041/ijcope.v2i5.850.

Praneesh, Dr, and D. Napoleon. "An Automated Image Captioning Framework Based on Vision Transformers and LSTM Networks." International Journal of Creative and Open Research in Engineering and Management 02, no. 05 (2026). https://doi.org/https://doi.org/10.55041/ijcope.v2i5.850.

Search & Index

References

[1] O. Vinyals, A. Toshev, S. Bengio and D. Erhan, "Show and tell: A neural image caption generator," 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3156-3164, doi: 10.1109/CVPR.2015.7298935.

[2] P. Mathur, A. Gill, A. Yadav, A. Mishra and N. K. Bansode, "Camera2Caption: A real-time image caption generator," 2017 International Conference on Computational Intelligence in Data Science(ICCIDS), 2017, pp. 1-6, doi:10.1109/ICCIDS.2017.8272660.

[3] Wang, Haoran, Yue Zhang, and Xiaosheng Yu. "An overview of image caption generation methods." Computational intelligence and neuroscience 2020 (2020).

[4] P. Anderson et al., "Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 6077-6086, doi: 10.1109/CVPR.2018.00636.

[5] Pedersoli M, Lucas T, Schmid C, Verbeek J (2017) Areas of attention for image captioning. In: 2017 IEEE international conference on computer vision (ICCV), pp 12511259

[6] Preksha Khant, Vishal Deshmukh, Aishwarya Kude, Prachi Kiraula, Image Caption Generator using CNN-LSTM International Research Journal of Engineering and Technology (IRJET), 2021

[7] Tanti M, Gatt A, Camilleri KP. What is the role of recurrent neural networks (rnns) in an image caption generator?. arXiv preprint arXiv:1708.02043. 2017 Aug 7.

[8] Chunseong Park C, Kim B, Kim G. Attend to you: Personalized image captioning with context sequence memory networks. InProceedings of the IEEE conference on computer vision and pattern recognition 2017 (pp. 895-903).

[9] Yang, Yang & Zhou, Jie & Ai, Jiangbo & Bin, Yi & Hanjalic, Alan & Shen, Heng & Ji, Yanli. (2018). Video Captioning by Adversarial LSTM. IEEE Transactions on Image Processing. 27. 1-1.10.1109/TIP.2018.2855422.

[10] Sherstinsky A. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Physica D: Nonlinear Phenomena. 2020 Mar 1;404:132306.

Ethical Compliance & Review Process

•All submissions are screened under plagiarism detection.
•Review follows editorial policy.
•Authors retain copyright.
•Peer Review Type: Double-Blind Peer Review
•Published on: May 31 2026

CCBYNC

This article is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. You are free to share and adapt this work for non-commercial purposes with proper attribution.

View License

Back to Volume 02, Issue 05 View All Issues Next Article

← Previous Article

An Analytical Study of Vehicle Loan Management System in NBFCs

Next Article →

An Empirical Analysis of Requirement–Product Fit and Its Impact on Laptop Purchase Decisions Among Management Students of Pune