Enhancing Visual Search Capabilities Through Visual Language Model

Rohith, P; S, Nayana; P, Prithviraj; H, Baba Fakruddin Ali B; Naik, Maulya; Doddamani, Harshavardhana

doi:https://doi.org/10.55041/ijcope.v2i6.162

Volume 02, Issue 6

Published on: June 2026

ENHANCING VISUAL SEARCH CAPABILITIES THROUGH VISUAL LANGUAGE MODEL

P Rohith Nayana S Prithviraj P Baba Fakruddin Ali B H Maulya Naik Harshavardhana Doddamani

Dept.of CSE Nagarjuna College Of Engineeringand TechnologyBengaluru, India

DOI:https://doi.org/10.55041/ijcope.v2i6.162

Article Status

Plagiarism Passed Peer Reviewed Open Access

Available Documents

Download PDF Review Report

Abstract

General-purpose Vision-Language Models (VLMs) like CLIP are suffered from a significant "domain gap" when they are applied to specialized fields, failing to differentiate nuanced visual categories. While fine-tuning is a known solution, the critical, secondary "data noise problem" that is arisen from using LLMs for dataset creation is addressed by this paper. It was found that nearly 19% of our initial LLM-generated culinary dataset was consisted of generic, "noisy" captions (e.g., "A photo of a food dish"). This work presents a comprehensive end-to-end methodology anchored in a rigorous data refinement framework designed to eliminate noise. This is combined with an iterative, sequential fine-tuning strategy that progressively has the learning rate decayed to prevent overfitting. This combined method was proved highly effective, with the model's performance being transformed on unseen validation data from a 77.56% baseline (on noisy data) to a peak accuracy of 93.00% (on the refined dataset). A reproducible blueprint for adapting general VLMs to niche domains is provided by this work, demonstrating that methodical data refinement is considered as critical as the model's architecture.

How to Cite this Paper

Rohith, P., S, N., P, P., H, B. F. A. B., Naik, M. & Doddamani, H. (2026). Enhancing Visual Search Capabilities Through Visual Language Model. International Journal of Creative and Open Research in Engineering and Management, <i>02</i>(6). https://doi.org/10.55041/ijcope.v2i6.162

Rohith, P, et al.. "Enhancing Visual Search Capabilities Through Visual Language Model." International Journal of Creative and Open Research in Engineering and Management, vol. 02, no. 6, 2026, pp. . doi:https://doi.org/10.55041/ijcope.v2i6.162.

Rohith, P,Nayana S,Prithviraj P,Baba H,Maulya Naik, and Harshavardhana Doddamani. "Enhancing Visual Search Capabilities Through Visual Language Model." International Journal of Creative and Open Research in Engineering and Management 02, no. 6 (2026). https://doi.org/https://doi.org/10.55041/ijcope.v2i6.162.

Search & Index

References

Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal,

Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and

Sutskever, "Learning Transferable Visual Models From Natural Language Supervision," in Proceedings of the 38th International Conference on Machine Learning (ICML), 2021, pp. 8748-8763. 1

Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, "Attention Is All You Need," in Advances in Neural Information Processing Systems 30 (NIPS), 2017,5998-6008. 2

Devlin, M. W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), 2019, pp. 4171-4186. 3

Jia, Y. Yang, Y. Xia, Y. T. Chen, Z. Parekh, H. Pham, Q. V. Le,

H. Sung, Z. Li, and T. Duerig, "Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision," in Proceedings of the 38th International Conference on Machine Learning (ICML), 2021, pp. 4904-4916. 4

Singh, R. Hu, V. Goswami, G. Couairon, W. Galuba, M. Rohrbach, and D. Kiela, "FLAVA: A Foundational Language and Vision Alignment Model," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 15638-15650. 5

Johnson, M. Douze, and H. Jégou, "Billion-scale similarity search with GPUs," IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535-547, 2021. 6

A. Malkov and D. A. Yashunin, "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 4, pp. 824-836, 2020. 7

Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal,

Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, "Learning Transferable Visual Models From Natural Language Supervision," in Proceedings of the 38th International Conference on Machine Learning (ICML), 2021, pp. 8748-8763.

Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal,

Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and

Sutskever, "Learning Transferable Visual Models From Natural Language Supervision," in Proceedings of the 38th International Conference on Machine Learning (ICML), 2021, pp. 8748-8763. 1

Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, "Attention Is All You Need," in Advances in Neural Information Processing Systems 30 (NIPS), 2017,5998-6008. 2

Ethical Compliance & Review Process

•All submissions are screened under plagiarism detection.
•Review follows editorial policy.
•Authors retain copyright.
•Peer Review Type: Double-Blind Peer Review
•Published on: Jun 13 2026

CCBYNC

This article is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. You are free to share and adapt this work for non-commercial purposes with proper attribution.

View License

Back to Volume 02, Issue 6 View All Issues Next Article

← Previous Article

Enhancing Oversight and Support: A Web-Based Management System for Migrant Workers

Next Article →

ESG Performance Evaluation of NIFTY 50 Companies: A Comparative Study