Published on: April 2026
A FRAMEWORK FOR RIGOROUS AND REPLICATION ML MODEL ASSESSMENT: INTEGRATING STATISTICAL SIGNIFICANCE AND PRACTICAL EVALUATION
Rakshit Ranjan Singh Avani Singh Kazim Mahadi
Adlin Jebakumari S
Article Status
Available Documents
Abstract
The rapid study of machine learning (ML) across major infrastructure, from healthcare diagnostics to financial risk assessment, has brought serious problems in model evaluation. A common problem in the field is the "accuracy trap." This happens when models are considered better based on accuracy level without assessing whether these gains are statistically genuine or practically meaningful. Furthermore, the field is struggling with a "replication problem," where the lack of versioned data, code, and environment details makes it difficult to verify many performance. This paper proposes a solution in the form of a Dual-Pillar Validation Framework. We introduce a methodology that combines strong statistical hypothesis testing (specifically Bootstrap Confidence Intervals and the 5x2cv paired t-test) with practical evaluation audits (covering fairness, data drift, and latency). By developing this framework within a repeatable environment MLOps pipeline utilizing Docker, DVC, and MLflow, we establish an automated "guardian" mechanism. This ensures that deployed models are not performing by coincidence, but are statistically strong, ethically sound, and efficient in practice.
How to Cite this Paper
Singh, R. R., Singh, A. & Mahadi, K. (2026). A Framework for Rigorous and Replication ML Model Assessment: Integrating Statistical Significance and Practical Evaluation. International Journal of Creative and Open Research in Engineering and Management, <i>02</i>(04). https://doi.org/10.55041/ijcope.v2i4.234
Singh, Rakshit, et al.. "A Framework for Rigorous and Replication ML Model Assessment: Integrating Statistical Significance and Practical Evaluation." International Journal of Creative and Open Research in Engineering and Management, vol. 02, no. 04, 2026, pp. . doi:https://doi.org/10.55041/ijcope.v2i4.234.
Singh, Rakshit,Avani Singh, and Kazim Mahadi. "A Framework for Rigorous and Replication ML Model Assessment: Integrating Statistical Significance and Practical Evaluation." International Journal of Creative and Open Research in Engineering and Management 02, no. 04 (2026). https://doi.org/https://doi.org/10.55041/ijcope.v2i4.234.
References
- Raschka, S. (2018). Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning. arXiv preprint arXiv:1811.12808.
- Pineau, J., et al. (2021). Improving reproducibility in machine learning research (A report from the NeurIPS 2019 reproducibility program). Journal of Machine Learning Research, 22(164), 1-20.
- Kapoor, S., & Narayanan, A. (2023). Leakage and the reproducibility crisis in machine-learning-based science. Patterns, 4(9).
- Google Cloud Architecture Center. (n.d.). MLOps: Continuous delivery and automation pipelines in machine learning. Retrieved from Google Cloud Documentation.
- Nagarajan, V., et al. (2019). Deterministic Implementations for Reproducibility in Deep Learning. NeurIPS Workshop.
- Boettiger, C. (2015). An introduction to Docker for reproducible research. ACM SIGOPS Operating Systems Review, 49(1), 71-79.
- Kuprieiev, R., et al. (2021). DVC: Data Version Control - Git for Data & Models. Iterative.
- Barrak, A., et al. (2021). Analysis of Data Versioning Tools for Machine Learning Operations. Medium.
- (2022). Statistical significance vs. practical significance. Retrieved from Scribbr.com.
- Lin, M., Lucas, H. C., & Shmueli, G. (2013). Too Big to Fail: Large Samples and the p-Value Problem. Information Systems Research, 24(4), 906-917.
Ethical Compliance & Review Process
- •All submissions are screened under plagiarism detection.
- •Review follows editorial policy.
- •Authors retain copyright.
- •Peer Review Type: Double-Blind Peer Review
- •Published on: Apr 23 2026
This article is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. You are free to share and adapt this work for non-commercial purposes with proper attribution.

