摘要
Introduction. Early cancer detection improves survival rates and reduces healthcare costs due to less treatment intensification, fewer hospitalizations and higher chances of remission. There is an increasing need for practical, interpretable screening tools that can effectively assess cancer risk and enable early identification of high-risk patients for timely intervention.
Aim. To develop and externally validate machine learning models for predicting the probability of cancer development in patients based on real-world clinical data.
Materials and Methods. Depersonalized HER data from 1.3 million patients were used for the study. Gender, age, weight change rate, erythrocyte sedimentation rate, hemoglobin, body mass index, and history of clinically significant comorbidities were considered as potential predictors. These comorbidities were defined by a medical expert and encompassed 54 conditions with documented associations with cancer risk. The performances of Logistic Regression, LGBMClassifier, Random Forest, Linear Discriminant Analysis, and Naive Bayes classifier models were compared. External validation was performed using data with different regional origin
Results. The LGBMClassifier demonstrated the best discrimination, with an AUROC of 0.807 (95% CI 0.798–0.815) during testing, 0.794 (95% CI 0.786–0.800) on external data set №1, and 0.790 (95% CI 0.782–0.798) on external data set №2. The model showed good calibration and utility metrics. Meta-evaluation results confirmed the its robustness.
Conclusion. The performance metrics of the model were superior to those in previously published studies. External validation results demonstrated the model's relative stability when applied to novel, independent data, confirming its potential for the clinical practice.
参考
Usher-Smith J., Emery J., Hamilton W., et al. Risk prediction tools for cancer in primary care. British Journal of Cancer. 2015; 113(12): 1645–50.-DOI: 10.1038/bjc.2015.409.
Chiang P.C., Glance D., Walker J., et al. Implementing a qcancer risk tool into general practice consultations: An exploratory study using simulated consultations with Australian general practitioners. Br J Cancer. 2015; 112(S1): S77–83.-DOI: 10.1038/bjc.2015.46.
Hippisley-Cox J., Coupland C. Development and validation of risk prediction algorithms to estimate future risk of common cancers in men and women: Prospective cohort study. BMJ Open. 2015; 5(3): e007825.-DOI: 10.1136/bmjopen-2015-007825.
Kulm S., Kofman L., Mezey J., Elemento O. Simple linear cancer risk prediction models with novel features outperform complex approaches. JCO Clin Cancer Inform. 2022; (6).-DOI: 10.1200/CCI.21.00166.
Miotto R., Li L., Kidd B.A., Dudley J.T. Deep patient: an unsupervised representation to predict the future of patients from the electronic health records. Sci Rep. 2016; (6).-DOI: 10.1038/srep26094.
Watson J., Salisbury C., Banks J., et al. Predictive value of inflammatory markers for cancer diagnosis in primary care: a prospective cohort study using electronic health records. Br J Cancer. 2019; 120(11): 1045–51.-DOI: 10.1038/s41416-019-0458-x.
Nicholson B.D., Hamilton W., Sullivan J.O’, et al. Weight loss as a predictor of cancer in primary care: A systematic review and meta-analysis. 2018; 68(670): e311–22, Royal College of General Practitioners.-DOI: 10.3399/bjgp18X695801.
Hung N., et al. Risk of cancer in patients with iron deficiency anemia: A nationwide population-based study. PLoS One. 2015; 10(3): e0119647.-DOI: 10.1371/journal.pone.0119647.
Star J., et al. Updated review of major cancer risk factors and screening test use in the United States, with a focus on changes during the COVID-19 pandemic. 2023; 32(7): 879–88. American Association for Cancer Research Inc.-DOI: 10.1158/1055-9965.EPI-23-0114.
Collins G.S., Reitsma J.B., Altman D.G., Moons K.G.M. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): The TRIPOD Statement. BMC Med. 2015; 13(1): 1.-DOI: 10.1186/s12916-014-0241-z.
Collins G.S., Moons K.G.M., Dhiman P., et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ. 2024; e078378.-DOI: 10.1136/bmj-2023-078378.
Kapoor S., Narayanan A. Leakage and the reproducibility crisis in machine-learning-based science. Patterns. 2023; 4(9): 100804.-DOI: 10.1016/j.patter.2023.100804.
Li C. Little′s test of missing completely at random. The Stata Journal. 2013; 13(4): 795–809.-DOI: 10.1177/1536867X1301300407.
Sokolova M., Lapalme G. A systematic analysis of performance measures for classification tasks. Inf Process Manag. 2009; 45(4): 427–37.-DOI: 10.1016/j.ipm.2009.03.002.
Zoubir A., Iskandler D. Bootstrap methods and applications. IEEE Signal Processing Magazine. 2007; 24(4): 10–9.-DOI: 10.1109/msp.2007.4286560.
Fischer G., Evans A.T. SpPin and SnNout are not enough. It’s Time to fully embrace likelihood ratios and probabilistic reasoning to achieve diagnostic excellence. J Gen Intern Med. 2023; 38(9): 2202–4.-DOI: 10.1007/s11606-023-08177-5.
Lundberg S.M., Erion G., Chen H., et al. From local explanations to global understanding with explainable AI for trees. Nat Mach Intell. 2020; 2(1): 56–67.-DOI: 10.1038/s42256-019-0138-9.
Van Calster B., McLernon D.J., van Smeden M., et al. Calibration: the Achilles heel of predictive analytics. BMC Med. 2019; 17(1).-DOI: 10.1186/s12916-019-1466-7.
Ding Y., Simonoff J. An investigation of missing data methods for classification trees applied to binary response data. J Mach Learn Res. 2010; 11: 131-70.-DOI: 10.5555/1756006.1756012.
Cao X.H., Stojkovic I., Obradovic Z. A robust data scaling algorithm to improve classification accuracies in biomedical data. BMC Bioinformatics. 2016; 17(1): 359.-DOI: 10.1186/s12859-016-1236-x.
de Amorim L.B.V., Cavalcanti G.D.C., Cruz R.M.O. The choice of scaling technique matters for classification performance. Appl Soft Comput. 2023; 133: 109924.-DOI: 10.1016/j.asoc.2022.109924.
Weiss G.M. Foundations of imbalanced learning. In: He H., Ma Y., eds. Imbalanced learning: foundations, algorithms, and applications. Hoboken (NJ): John Wiley & Sons. 2013; 13-41.-ISBN: 9781118074626.
Ke G., Meng Q., Finley T., et al. LightGBM: a highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst. 2017.
Breiman L. Random forests. Mach Learn. 2001; 45(1): 5-32.-DOI: 10.1023/A:1010933404324.
Akiba T., Sano S., Yanase T., et al. Optuna: a next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; 2019; 2623-31.-DOI: 10.1145/3292500.3330701.
Wester D.B. Comparing treatment means: overlapping standard errors, overlapping confidence intervals, and tests of hypothesis. Biom Biostat Int J. 2018; 7(1): 73-85.-DOI: 10.15406/bbij.2018.07.00192.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
© АННМО «Вопросы онкологии», Copyright (c) 2025