A Framework for Rigorous and Replication ML Model Assessment: Integrating Statistical Significance and Practical Evaluation

Singh, Rakshit Ranjan; Singh, Avani; Mahadi, Kazim

doi:https://doi.org/10.55041/ijcope.v2i4.234

Volume 02, Issue 04

Published on: April 2026

A FRAMEWORK FOR RIGOROUS AND REPLICATION ML MODEL ASSESSMENT: INTEGRATING STATISTICAL SIGNIFICANCE AND PRACTICAL EVALUATION

Rakshit Ranjan Singh Avani Singh Kazim Mahadi

Adlin Jebakumari S

School of Computer Science and Information Technology,JAIN (Deemed-to-be University), Bengaluru, India

DOI:https://doi.org/10.55041/ijcope.v2i4.234

Article Status

Plagiarism Passed Peer Reviewed Open Access

Available Documents

Download PDF Review Report

Abstract

The rapid study of machine learning (ML) across major infrastructure, from healthcare diagnostics to financial risk assessment, has brought serious problems in model evaluation. A common problem in the field is the "accuracy trap." This happens when models are considered better based on accuracy level without assessing whether these gains are statistically genuine or practically meaningful. Furthermore, the field is struggling with a "replication problem," where the lack of versioned data, code, and environment details makes it difficult to verify many performance. This paper proposes a solution in the form of a Dual-Pillar Validation Framework. We introduce a methodology that combines strong statistical hypothesis testing (specifically Bootstrap Confidence Intervals and the 5x2cv paired t-test) with practical evaluation audits (covering fairness, data drift, and latency). By developing this framework within a repeatable environment MLOps pipeline utilizing Docker, DVC, and MLflow, we establish an automated "guardian" mechanism. This ensures that deployed models are not performing by coincidence, but are statistically strong, ethically sound, and efficient in practice.

How to Cite this Paper

Singh, R. R., Singh, A. & Mahadi, K. (2026). A Framework for Rigorous and Replication ML Model Assessment: Integrating Statistical Significance and Practical Evaluation. International Journal of Creative and Open Research in Engineering and Management, <i>02</i>(04). https://doi.org/10.55041/ijcope.v2i4.234

Singh, Rakshit, et al.. "A Framework for Rigorous and Replication ML Model Assessment: Integrating Statistical Significance and Practical Evaluation." International Journal of Creative and Open Research in Engineering and Management, vol. 02, no. 04, 2026, pp. . doi:https://doi.org/10.55041/ijcope.v2i4.234.

Singh, Rakshit,Avani Singh, and Kazim Mahadi. "A Framework for Rigorous and Replication ML Model Assessment: Integrating Statistical Significance and Practical Evaluation." International Journal of Creative and Open Research in Engineering and Management 02, no. 04 (2026). https://doi.org/https://doi.org/10.55041/ijcope.v2i4.234.

Search & Index

References

Raschka, S. (2018). Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning. arXiv preprint arXiv:1811.12808.

Pineau, J., et al. (2021). Improving reproducibility in machine learning research (A report from the NeurIPS 2019 reproducibility program). Journal of Machine Learning Research, 22(164), 1-20.

Kapoor, S., & Narayanan, A. (2023). Leakage and the reproducibility crisis in machine-learning-based science. Patterns, 4(9).

Google Cloud Architecture Center. (n.d.). MLOps: Continuous delivery and automation pipelines in machine learning. Retrieved from Google Cloud Documentation.

Nagarajan, V., et al. (2019). Deterministic Implementations for Reproducibility in Deep Learning. NeurIPS Workshop.

Boettiger, C. (2015). An introduction to Docker for reproducible research. ACM SIGOPS Operating Systems Review, 49(1), 71-79.

Kuprieiev, R., et al. (2021). DVC: Data Version Control - Git for Data & Models. Iterative.

Barrak, A., et al. (2021). Analysis of Data Versioning Tools for Machine Learning Operations. Medium.

(2022). Statistical significance vs. practical significance. Retrieved from Scribbr.com.

Lin, M., Lucas, H. C., & Shmueli, G. (2013). Too Big to Fail: Large Samples and the p-Value Problem. Information Systems Research, 24(4), 906-917.

Ethical Compliance & Review Process

•All submissions are screened under plagiarism detection.
•Review follows editorial policy.
•Authors retain copyright.
•Peer Review Type: Double-Blind Peer Review
•Published on: Apr 23 2026

CCBYNC

This article is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. You are free to share and adapt this work for non-commercial purposes with proper attribution.

View License

Back to Volume 02, Issue 04 View All Issues Next Article

← Previous Article

A Framework for Machine Learning-Based Intrusion Detection to Find Denial of Service Attacks in Network Traffic

Next Article →

A Hybrid Air-Water Pollution Monitoring System Using Outlier Detection And Adaptive Feature Selection For Aqi-Based Environment Assessment