Data lake-aware checkpointing: Enabling resilient large-scale model training through precise data consumption tracking

Nandamuri, Sravankumar (2025) Data lake-aware checkpointing: Enabling resilient large-scale model training through precise data consumption tracking. World Journal of Advanced Engineering Technology and Sciences, 15 (2). pp. 2091-2098. ISSN 2582-8266

[thumbnail of WJAETS-2025-0686.pdf] Article PDF
WJAETS-2025-0686.pdf - Published Version
Available under License Creative Commons Attribution Non-commercial Share Alike.

Download ( 545kB)

Abstract

Data Lake-Aware Checkpointing addresses a critical gap in large-scale model training resilience by incorporating reader state as a first-class citizen in training checkpoints. While traditional frameworks save only model and optimizer states, they neglect data reader progress, leading to overlapping or missed data reads when resuming training from massive data lakes. This article proposes a system that tracks consumed Parquet files and row group offsets across distributed fleets, and includes that information as part of the checkpoint, creating comprehensive checkpoints that enable precise recovery without data loss or duplication. It integrates seamlessly with existing training pipelines and distributed storage systems, establishing the foundation for truly epoch-less, streaming-style training on vast data repositories. By treating data consumption state with the same importance as model parameters, we significantly enhance fault tolerance and training reliability for large language models at scale.

Item Type: Article
Official URL: https://doi.org/10.30574/wjaets.2025.15.2.0686
Uncontrolled Keywords: Distributed Training; Fault Tolerance; Checkpointing; Data Lakes; Large Language Models
Depositing User: Editor Engineering Section
Date Deposited: 04 Aug 2025 16:39
Related URLs:
URI: https://eprint.scholarsrepository.com/id/eprint/4002