•  
  •  
 

Article Type

Original Study

Abstract

Fault tolerance is a critical requirement in distributed systems, as node and network failures can cause significant data loss, service disruption, and performance degradation. Traditional fault-tolerance methods often provide partial solutions, relying on separate recovery models for errors and delayed responses, while also incurring high resource costs. This paper aims to design and implement a more robust and efficient fault-tolerant architecture based on the concept of a shared checkpoint replica. In the proposed system, a master server and multiple slave servers collaboratively process client requests, share checkpoint replicas, and ensure seamless recovery in case of failures. To evaluate the approach, a distributed text extraction system from images was developed as a testbed. The primary objectives are to minimize fault recovery time, reduce network resource consumption, and maintain request integrity even in the presence of failures. Experimental results show that the shared checkpoint replica mechanism significantly outperforms conventional checkpoint (9 seconds recovery) and replica fault tolerance (5 seconds recovery) methods, achieving a recovery time of only 1.5 seconds. Furthermore, it delivers a throughput of 28.5 requests per second with an average latency of 3.5 milliseconds, meeting the research goal of faster and more reliable distributed processing.

Keywords

Fault tolerance, Checkpoint, Replica, Distributed system

Creative Commons License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Share

COinS