Enhancing Fault Tolerance in Distributed Systems Using Shared Checkpoint Replica Mechanisms

Ameer A. Mousa, Department of Network, College of Information Technology, Babylon University, Babil, IraqFollow
Mahdi S. Almhanna, Department of Network, College of Information Technology, Babylon University, Babil, IraqFollow

ORCID

Ameer A. Mousa: https://orcid.org/0009-0007-1537-6528

Mahdi S. Almhanna: https://orcid.org/0000-0002-3144-5358

Article Type

Original Study

Abstract

Fault tolerance is a critical requirement in distributed systems, as node and network failures can cause significant data loss, service disruption, and performance degradation. Traditional fault-tolerance methods often provide partial solutions, relying on separate recovery models for errors and delayed responses, while also incurring high resource costs. This paper aims to design and implement a more robust and efficient fault-tolerant architecture based on the concept of a shared checkpoint replica. In the proposed system, a master server and multiple slave servers collaboratively process client requests, share checkpoint replicas, and ensure seamless recovery in case of failures. To evaluate the approach, a distributed text extraction system from images was developed as a testbed. The primary objectives are to minimize fault recovery time, reduce network resource consumption, and maintain request integrity even in the presence of failures. Experimental results show that the shared checkpoint replica mechanism significantly outperforms conventional checkpoint (9 seconds recovery) and replica fault tolerance (5 seconds recovery) methods, achieving a recovery time of only 1.5 seconds. Furthermore, it delivers a throughput of 28.5 requests per second with an average latency of 3.5 milliseconds, meeting the research goal of faster and more reliable distributed processing.

Keywords

Fault tolerance, Checkpoint, Replica, Distributed system

How to Cite This Article

Mousa, Ameer A. and Almhanna, Mahdi S. (2025) "Enhancing Fault Tolerance in Distributed Systems Using Shared Checkpoint Replica Mechanisms," Journal of Intelligent Informatics, Networking, and Cybersecurity: Vol. 1 : Iss. 2 , Article 1.
Available at: https://doi.org/10.65445/3106-1192.1004