Generating high-quality and diverse synthetic datasets with large language models: A survey

Rajendran, Abinandaraj (2025) Generating high-quality and diverse synthetic datasets with large language models: A survey. World Journal of Advanced Engineering Technology and Sciences, 15 (2). pp. 1145-1149. ISSN 2582-8266

Abstract

Large Language Models (LLMs) are increasingly leveraged to generate synthetic datasets that overcome challenges in real-world data collection, including privacy risks, imbalance, and scarcity. This paper surveys recent developments in LLM-based synthetic data generation, emphasizing techniques that improve diversity, task alignment, and reliability—crucial factors in high-stakes domains such as predictive maintenance. We categorize state-of-the-art approaches into four methodological pillars: prompt engineering, multi-step generation pipelines, quality control through data curation, and rigorous evaluation methods. Structured generation workflows and controlled prompting strategies significantly enhance output coherence and domain relevance, while self-correction mechanisms and diversity-aware metrics contribute to higher dataset fidelity. Despite progress, open challenges persist, including bias propagation, limited generalization across tasks and modalities, and the need for robust ethical safeguards. We outline promising future directions—such as integrating external knowledge, expanding to multilingual and multimodal settings, and fostering human-AI collaboration—for advancing synthetic data generation using LLMs.

Item Type:	Article
Official URL:	https://doi.org/10.30574/wjaets.2025.15.2.0652
Uncontrolled Keywords:	Synthetic Data Generation; Large Language Models; Predictive Maintenance; Anomaly Detection; Disk Failure Prediction; Cloud Storage Systems
Date Deposited:	04 Aug 2025 16:33
Related URLs:	https://journalwjaets.com/node/742 https://journalwjaets.com/sites/default/... https://doi.org/10.30574/wjaets.2025.15....
URI:	https://eprint.scholarsrepository.com/id/eprint/3694

View Item