Published on: June 2026
ENHANCING VISUAL SEARCH CAPABILITIES THROUGH VISUAL LANGUAGE MODEL
P Rohith Nayana S Prithviraj P Baba Fakruddin Ali B H Maulya Naik Harshavardhana Doddamani
Article Status
Available Documents
Abstract
General-purpose Vision-Language Models (VLMs) like CLIP are suffered from a significant "domain gap" when they are applied to specialized fields, failing to differentiate nuanced visual categories. While fine-tuning is a known solution, the critical, secondary "data noise problem" that is arisen from using LLMs for dataset creation is addressed by this paper. It was found that nearly 19% of our initial LLM-generated culinary dataset was consisted of generic, "noisy" captions (e.g., "A photo of a food dish"). This work presents a comprehensive end-to-end methodology anchored in a rigorous data refinement framework designed to eliminate noise. This is combined with an iterative, sequential fine-tuning strategy that progressively has the learning rate decayed to prevent overfitting. This combined method was proved highly effective, with the model's performance being transformed on unseen validation data from a 77.56% baseline (on noisy data) to a peak accuracy of 93.00% (on the refined dataset). A reproducible blueprint for adapting general VLMs to niche domains is provided by this work, demonstrating that methodical data refinement is considered as critical as the model's architecture.
How to Cite this Paper
Rohith, P., S, N., P, P., H, B. F. A. B., Naik, M. & Doddamani, H. (2026). Enhancing Visual Search Capabilities Through Visual Language Model. International Journal of Creative and Open Research in Engineering and Management, <i>02</i>(6). https://doi.org/10.55041/ijcope.v2i6.162
Rohith, P, et al.. "Enhancing Visual Search Capabilities Through Visual Language Model." International Journal of Creative and Open Research in Engineering and Management, vol. 02, no. 6, 2026, pp. . doi:https://doi.org/10.55041/ijcope.v2i6.162.
Rohith, P,Nayana S,Prithviraj P,Baba H,Maulya Naik, and Harshavardhana Doddamani. "Enhancing Visual Search Capabilities Through Visual Language Model." International Journal of Creative and Open Research in Engineering and Management 02, no. 6 (2026). https://doi.org/https://doi.org/10.55041/ijcope.v2i6.162.
References
- Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal,
- Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and
- Sutskever, "Learning Transferable Visual Models From Natural Language Supervision," in Proceedings of the 38th International Conference on Machine Learning (ICML), 2021, pp. 8748-8763. 1
- Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, "Attention Is All You Need," in Advances in Neural Information Processing Systems 30 (NIPS), 2017,5998-6008. 2
- Devlin, M. W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), 2019, pp. 4171-4186. 3
- Jia, Y. Yang, Y. Xia, Y. T. Chen, Z. Parekh, H. Pham, Q. V. Le,
- H. Sung, Z. Li, and T. Duerig, "Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision," in Proceedings of the 38th International Conference on Machine Learning (ICML), 2021, pp. 4904-4916. 4
- Singh, R. Hu, V. Goswami, G. Couairon, W. Galuba, M. Rohrbach, and D. Kiela, "FLAVA: A Foundational Language and Vision Alignment Model," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 15638-15650. 5
- Johnson, M. Douze, and H. Jégou, "Billion-scale similarity search with GPUs," IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535-547, 2021. 6
- A. Malkov and D. A. Yashunin, "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 4, pp. 824-836, 2020. 7
- Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal,
- Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, "Learning Transferable Visual Models From Natural Language Supervision," in Proceedings of the 38th International Conference on Machine Learning (ICML), 2021, pp. 8748-8763.
- Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal,
- Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and
- Sutskever, "Learning Transferable Visual Models From Natural Language Supervision," in Proceedings of the 38th International Conference on Machine Learning (ICML), 2021, pp. 8748-8763. 1
- Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, "Attention Is All You Need," in Advances in Neural Information Processing Systems 30 (NIPS), 2017,5998-6008. 2
Ethical Compliance & Review Process
- •All submissions are screened under plagiarism detection.
- •Review follows editorial policy.
- •Authors retain copyright.
- •Peer Review Type: Double-Blind Peer Review
- •Published on: Jun 13 2026
This article is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. You are free to share and adapt this work for non-commercial purposes with proper attribution.

