Zaeifi, Mehdi and Yerra, Sindhuja Rao (2025) Data sources, quality and preprocessing challenges in cancer research: A comprehensive review. World Journal of Advanced Engineering Technology and Sciences, 15 (3). pp. 1866-1871. ISSN 2582-8266
![WJAETS-2025-1120.pdf [thumbnail of WJAETS-2025-1120.pdf]](https://eprint.scholarsrepository.com/style/images/fileicons/text.png)
WJAETS-2025-1120.pdf - Published Version
Available under License Creative Commons Attribution Non-commercial Share Alike.
Abstract
Cancer is a major global health issue and is responsible for close to one-sixth of all global deaths, states the World Health Organization. With pervasive availability of multi-modal datasets such as genomic profiles, clinical reports, and imaging data, cancer biology and patient care have been advanced greatly. Contemporary data science techniques, from simple description and statistical analyses to advanced machine learning models, are all highly dependent on the quality and reliability of such data. While several studies emphasize working on new models with minimal attention to the fundamental step of data preprocessing, various problems like missing values, measurement error, heterogeneity, and privacy can dramatically degrade the validity of research findings if ignored. Therefore, enhancing data curation and preprocessing standardization and raising awareness is imperative for improving reproducible and influential cancer research. This paper seeks to (i) give an overview of some of the major publicly accessible cancer datasets, (ii) describe some of the shared data quality problems experienced within cancer research, and (iii) make recommendations and synthesize best practices for data preprocessing and management. The target group comprises data scientists and researchers dealing with oncology, bioinformatics, and biomedical informatics. Maintaining high data quality is more than just a technical task it’s an ethical responsibility. Inaccurate or poorly handled data can lead to misleading clinical decisions and ultimately impact patient care and outcomes.
Item Type: | Article |
---|---|
Official URL: | https://doi.org/10.30574/wjaets.2025.15.3.1120 |
Uncontrolled Keywords: | Cancer Informatics; Data Quality; Preprocessing; Big Data; Batch Effects; Missing Data |
Depositing User: | Editor Engineering Section |
Date Deposited: | 16 Aug 2025 13:17 |
Related URLs: | |
URI: | https://eprint.scholarsrepository.com/id/eprint/4849 |