Abstract
Introduction. Timely cancer diagnosis significantly improves patient survival rates while reducing healthcare costs through decreased hospitalizations and increased likelihood of remission. There remains an urgent need for practical, clinically interpretable screening tools capable of effectively identifying at-risk patients to enable early intervention.
Aim. To develop and externally validate machine learning models for predicting 18-month cancer risk using real-world clinical data.
Materials and Methods. The study analyzed anonymized electronic health records (EHR) from 1.3 million patients across 36 Russian regions. We examined multiple predictors including sex, age, monthly weight change rate, erythrocyte sedimentation rate, hemoglobin levels, body mass index, and clinically significant comorbidities. The primary outcome was any cancer diagnosis classified under ICD-10 codes C00-C96, which occurred in 177,384 patients. We conducted comparative analysis of five machine learning approaches: Logistic Regression, LightGBM Classifier, Random Forest, Linear Discriminant Analysis, and Naïve Bayes. External validation was performed using two independent geographically distinct patient cohorts (n=29,681 and n=25,145) to evaluate model generalizability across diverse populations.
Results. The LightGBM classifier achieved superior performance, demonstrating an AUROC of 0.807 (95% CI 0.798-0.815) during internal validation. The model maintained strong discrimination on two independent external validation sets: 0.794 (95% CI 0.786-0.800) for geographically distinct data and 0.790 (95% CI 0.782-0.798) for temporally distinct data.
Conclusion. Our machine learning approach utilizing routinely available clinical, laboratory, and anamnestic features proved both effective and practical. The model outperformed existing benchmarks from prior studies while demonstrating consistent performance across external validation cohorts.
References
Usher-Smith J., Emery J., Hamilton W., et al. Risk prediction tools for cancer in primary care. British Journal of Cancer. 2015; 113(12): 1645–50.-DOI: 10.1038/bjc.2015.409.
Chiang P.C., Glance D., Walker J., et al. Implementing a qcancer risk tool into general practice consultations: An exploratory study using simulated consultations with Australian general practitioners. Br J Cancer. 2015; 112(S1): S77–83.-DOI: 10.1038/bjc.2015.46.
Hippisley-Cox J., Coupland C. Development and validation of risk prediction algorithms to estimate future risk of common cancers in men and women: Prospective cohort study. BMJ Open. 2015; 5(3): e007825.-DOI: 10.1136/bmjopen-2015-007825.
Kulm S., Kofman L., Mezey J., Elemento O. Simple linear cancer risk prediction models with novel features outperform complex approaches. JCO Clin Cancer Inform. 2022; (6).-DOI: 10.1200/CCI.21.00166.
Miotto R., Li L., Kidd B.A., Dudley J.T. Deep patient: an unsupervised representation to predict the future of patients from the electronic health records. Sci Rep. 2016; (6).-DOI: 10.1038/srep26094.
Watson J., Salisbury C., Banks J., et al. Predictive value of inflammatory markers for cancer diagnosis in primary care: a prospective cohort study using electronic health records. Br J Cancer. 2019; 120(11): 1045–51.-DOI: 10.1038/s41416-019-0458-x.
Nicholson B.D., Hamilton W., Sullivan J.O’, et al. Weight loss as a predictor of cancer in primary care: A systematic review and meta-analysis. 2018; 68(670): e311–22, Royal College of General Practitioners.-DOI: 10.3399/bjgp18X695801.
Hung N., et al. Risk of cancer in patients with iron deficiency anemia: A nationwide population-based study. PLoS One. 2015; 10(3): e0119647.-DOI: 10.1371/journal.pone.0119647.
Star J., et al. Updated review of major cancer risk factors and screening test use in the United States, with a focus on changes during the COVID-19 pandemic. 2023; 32(7): 879–88. American Association for Cancer Research Inc.-DOI: 10.1158/1055-9965.EPI-23-0114.
Collins G.S., Reitsma J.B., Altman D.G., Moons K.G.M. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): The TRIPOD Statement. BMC Med. 2015; 13(1): 1.-DOI: 10.1186/s12916-014-0241-z.
Collins G.S., Moons K.G.M., Dhiman P., et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ. 2024; e078378.-DOI: 10.1136/bmj-2023-078378.
Kapoor S., Narayanan A. Leakage and the reproducibility crisis in machine-learning-based science. Patterns. 2023; 4(9): 100804.-DOI: 10.1016/j.patter.2023.100804.
Li C. Little′s test of missing completely at random. The Stata Journal. 2013; 13(4): 795–809.-DOI: 10.1177/1536867X1301300407.
Sokolova M., Lapalme G. A systematic analysis of performance measures for classification tasks. Inf Process Manag. 2009; 45(4): 427–37.-DOI: 10.1016/j.ipm.2009.03.002.
Zoubir A., Iskandler D. Bootstrap methods and applications. IEEE Signal Processing Magazine. 2007; 24(4): 10–9.-DOI: 10.1109/msp.2007.4286560.
Fischer G., Evans A.T. SpPin and SnNout are not enough. It’s Time to fully embrace likelihood ratios and probabilistic reasoning to achieve diagnostic excellence. J Gen Intern Med. 2023; 38(9): 2202–4.-DOI: 10.1007/s11606-023-08177-5.
Lundberg S.M., Erion G., Chen H., et al. From local explanations to global understanding with explainable AI for trees. Nat Mach Intell. 2020; 2(1): 56–67.-DOI: 10.1038/s42256-019-0138-9.
Van Calster B., McLernon D.J., van Smeden M., et al. Calibration: the Achilles heel of predictive analytics. BMC Med. 2019; 17(1).-DOI: 10.1186/s12916-019-1466-7.
Ding Y., Simonoff J. An investigation of missing data methods for classification trees applied to binary response data. J Mach Learn Res. 2010; 11: 131-70.-DOI: 10.5555/1756006.1756012.
Cao X.H., Stojkovic I., Obradovic Z. A robust data scaling algorithm to improve classification accuracies in biomedical data. BMC Bioinformatics. 2016; 17(1): 359.-DOI: 10.1186/s12859-016-1236-x.
de Amorim L.B.V., Cavalcanti G.D.C., Cruz R.M.O. The choice of scaling technique matters for classification performance. Appl Soft Comput. 2023; 133: 109924.-DOI: 10.1016/j.asoc.2022.109924.
Weiss G.M. Foundations of imbalanced learning. In: He H., Ma Y., eds. Imbalanced learning: foundations, algorithms, and applications. Hoboken (NJ): John Wiley & Sons. 2013; 13-41.-ISBN: 9781118074626.
Ke G., Meng Q., Finley T., et al. LightGBM: a highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst. 2017.
Breiman L. Random forests. Mach Learn. 2001; 45(1): 5-32.-DOI: 10.1023/A:1010933404324.
Akiba T., Sano S., Yanase T., et al. Optuna: a next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; 2019; 2623-31.-DOI: 10.1145/3292500.3330701.
Wester D.B. Comparing treatment means: overlapping standard errors, overlapping confidence intervals, and tests of hypothesis. Biom Biostat Int J. 2018; 7(1): 73-85.-DOI: 10.15406/bbij.2018.07.00192.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
© АННМО «Вопросы онкологии», Copyright (c) 2025