Shieldllm: A Hybrid Adversarial Prompt Injection Detection Framework for Securing Large Language Models

Kumar, Vijay

doi:https://doi.org/10.55041/ijcope.v2i4.463

Volume 02, Issue 04

Published on: April 2026

SHIELDLLM: A HYBRID ADVERSARIAL PROMPT INJECTION DETECTION FRAMEWORK FOR SECURING LARGE LANGUAGE MODELS

Vijay Kumar

Department of Computer Science and Engineering, Parul Institute of Technology, Parul University, Gujarat, India

DOI:https://doi.org/10.55041/ijcope.v2i4.463

Article Status

Plagiarism Passed Peer Reviewed Open Access

Available Documents

Download PDF Review Report

Abstract

Large Language Models (LLMs) have rapidly permeated enterprise, consumer, and governmental applications, fundamentally transforming the human–computer interaction paradigm. However, their widespread deployment has exposed critical security vulnerabilities, most notably adversarial prompt injection attacks, in which maliciously crafted inputs are designed to override system-level instructions, exfiltrate sensitive data, or hijack model behaviour. Existing safeguards, such as coarse-grained output filters and rule-based blocklists, are demonstrably insufficient against semantically sophisticated attack vectors. This paper proposes ShieldLLM—a real-time, hybrid AI firewall that combines Bidirectional Encoder Representations from Transformers (BERT)-derived semantic embeddings with an ensemble Random Forest classifier and a complementary rule-based detection layer to classify incoming prompts as either Safe or Injection Attack with a latency budget under 45 ms. Evaluated on a corpus of 10,000 labelled prompts spanning five injection sub-categories, ShieldLLM achieves 96.3% accuracy, 95.8% precision, 95.7% recall, and an AUC-ROC of 0.982, surpassing all evaluated baselines. The framework is architecturally agnostic and can be integrated as middleware within any LLM serving stack. This work advances the nascent field of LLM-specific intrusion detection and provides a reproducible benchmark dataset for the research community

How to Cite this Paper

Kumar, V. (2026). Shieldllm: A Hybrid Adversarial Prompt Injection Detection Framework for Securing Large Language Models. International Journal of Creative and Open Research in Engineering and Management, <i>02</i>(04). https://doi.org/10.55041/ijcope.v2i4.463

Kumar, Vijay. "Shieldllm: A Hybrid Adversarial Prompt Injection Detection Framework for Securing Large Language Models." International Journal of Creative and Open Research in Engineering and Management, vol. 02, no. 04, 2026, pp. . doi:https://doi.org/10.55041/ijcope.v2i4.463.

Kumar, Vijay. "Shieldllm: A Hybrid Adversarial Prompt Injection Detection Framework for Securing Large Language Models." International Journal of Creative and Open Research in Engineering and Management 02, no. 04 (2026). https://doi.org/https://doi.org/10.55041/ijcope.v2i4.463.

Search & Index

References

[1] T. B. Brown et al., "Language models are few-shot learners," in Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 1877–1901, 2020.

[2] F. Perez and I. Ribeiro, "Ignore previous prompt: Attack techniques for language models," in Proc. NeurIPS Workshop on Machine Learning Safety, New Orleans, LA, USA, Nov. 2022.

[3] K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, "Not what you've signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection," in Proc. ACM Workshop on Artificial Intelligence and Security (AISec), Copenhagen, Denmark, 2023, pp. 79–90.

[4] A. Alon and M. Kamfonas, "Detecting language model attacks with perplexity," in Proc. ICLR Workshop on Secure and Trustworthy Large Language Models, Vienna, Austria, 2024.

[5] I. Markov, A. Dey, O. Harel, and Y. Goel, "Holistic approach to undesired content detection in the real world," in Proc. AAAI Conference on Artificial Intelligence, Washington, DC, USA, 2023, vol. 37, no. 12, pp. 15009–15018.

[6] J. Ebrahimi, A. Rao, D. Lowd, and D. Dou, "HotFlip: White-box adversarial examples for text classification," in Proc. 56th Annual Meeting of the Association for Computational Linguistics (ACL), Melbourne, Australia, 2018, pp. 31–36.

[7] E. Wallace, S. Feng, N. Kandpal, M. Gardner, and S. Singh, "Universal adversarial triggers for attacking and analyzing NLP," in Proc. Conference on Empirical Methods in Natural Language Processing (EMNLP), Hong Kong, 2019, pp. 2153–2162.

[8] L. Ouyang et al., "Training language models to follow instructions with human feedback," in Advances in Neural Information Processing Systems (NeurIPS), vol. 35, pp. 27730–27744, 2022.

[9] A. Wei, N. Haghtalab, and J. Steinhardt, "Jailbroken: How does LLM safety training fail?" in Advances in Neural Information Processing Systems (NeurIPS), vol. 36, pp. 80079–80110, 2023.

[10] Y. Bai et al., "Constitutional AI: Harmlessness from AI feedback," arXiv preprint arXiv:2212.08073, Dec. 2022.

Ethical Compliance & Review Process

•All submissions are screened under plagiarism detection.
•Review follows editorial policy.
•Authors retain copyright.
•Peer Review Type: Double-Blind Peer Review
•Published on: Apr 18 2026

CCBYNC

This article is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. You are free to share and adapt this work for non-commercial purposes with proper attribution.

View License

Back to Volume 02, Issue 04 View All Issues Next Article

← Previous Article

She Found

Next Article →

Sign Gesture to Audio Conversion