The role of AI/ML in improving system reliability of large-scale distributed systems

Sekar, Aravind (2025) The role of AI/ML in improving system reliability of large-scale distributed systems. World Journal of Advanced Research and Reviews, 26 (1). pp. 1007-1020. ISSN 2581-9615

[thumbnail of WJARR-2025-1064.pdf] Article PDF
WJARR-2025-1064.pdf - Published Version
Available under License Creative Commons Attribution Non-commercial Share Alike.

Download ( 588kB)

Abstract

This article explores the transformative role of artificial intelligence and machine learning in enhancing system reliability across large-scale distributed systems. The article examines how AI/ML technologies are revolutionizing reliability engineering through predictive capacity management, autonomous monitoring, advanced anomaly detection, and integrated security approaches. The article demonstrates that properly implemented AI/ML solutions significantly reduce incident frequency and resolution times while optimizing resource utilization and decreasing operational costs. We present a comprehensive theoretical framework for AI-enhanced reliability and analyze real-world applications across multiple domains. The article evaluates both technical implementations and their quantifiable business impacts, showing typical operational cost reductions and engineer toil reductions in mature deployments. The article addresses critical challenges including data quality constraints, model explainability issues, and human-AI collaboration complexities while exploring promising future directions in reinforcement learning, real-time inference, and self-improving frameworks. This article provides reliability engineers, system architects, and organizational leaders with actionable insights for implementing AI/ML approaches that enhance distributed system resilience in increasingly complex technological environments.

Item Type: Article
Official URL: https://doi.org/10.30574/wjarr.2025.26.1.1064
Uncontrolled Keywords: Aiops; System Reliability; Distributed Systems; Predictive Remediation; Autonomous Recovery
Depositing User: Editor WJARR
Date Deposited: 22 Jul 2025 23:44
Related URLs:
URI: https://eprint.scholarsrepository.com/id/eprint/1721