Использование машинного обучения для прогнозирования онкологических заболеваний на основе данных электронных медицинских карт: автоматизированный подход к скринингу

Ermak, A., Gavrilov, D., Novitskiy, R., Gusev, A., Komarov, Y., & Andreychenko, A. (2025). Machine learning for cancer detection based on EHR data: an automated screening approach. VOPROSY ONKOLOGII, 71(4), OF–2258. https://doi.org/10.37469/0507-3758-2025-71-4-OF-2258

摘要

Introduction. Early cancer detection improves survival rates and reduces healthcare costs due to less treatment intensification, fewer hospitalizations and higher chances of remission. There is an increasing need for practical, interpretable screening tools that can effectively assess cancer risk and enable early identification of high-risk patients for timely intervention.

Aim. To develop and externally validate machine learning models for predicting the probability of cancer development in patients based on real-world clinical data.

Materials and Methods. Depersonalized HER data from 1.3 million patients were used for the study. Gender, age, weight change rate, erythrocyte sedimentation rate, hemoglobin, body mass index, and history of clinically significant comorbidities were considered as potential predictors. These comorbidities were defined by a medical expert and encompassed 54 conditions with documented associations with cancer risk. The performances of Logistic Regression, LGBMClassifier, Random Forest, Linear Discriminant Analysis, and Naive Bayes classifier models were compared. External validation was performed using data with different regional origin

Results. The LGBMClassifier demonstrated the best discrimination, with an AUROC of 0.807 (95% CI 0.798–0.815) during testing, 0.794 (95% CI 0.786–0.800) on external data set №1, and 0.790 (95% CI 0.782–0.798) on external data set №2. The model showed good calibration and utility metrics. Meta-evaluation results confirmed the its robustness.

Conclusion. The performance metrics of the model were superior to those in previously published studies. External validation results demonstrated the model's relative stability when applied to novel, independent data, confirming its potential for the clinical practice.

https://doi.org/10.37469/0507-3758-2025-71-4-OF-2258

##article.numberofdownloads## 74

##article.numberofdownloads## 21

##article.numberofviews## 309

pdf (Русский)

pdf suppl (Русский)

参考

Usher-Smith J., Emery J., Hamilton W., et al. Risk prediction tools for cancer in primary care. British Journal of Cancer. 2015; 113(12): 1645–50.-DOI: 10.1038/bjc.2015.409.

Chiang P.C., Glance D., Walker J., et al. Implementing a qcancer risk tool into general practice consultations: An exploratory study using simulated consultations with Australian general practitioners. Br J Cancer. 2015; 112(S1): S77–83.-DOI: 10.1038/bjc.2015.46.

Hippisley-Cox J., Coupland C. Development and validation of risk prediction algorithms to estimate future risk of common cancers in men and women: Prospective cohort study. BMJ Open. 2015; 5(3): e007825.-DOI: 10.1136/bmjopen-2015-007825.

Kulm S., Kofman L., Mezey J., Elemento O. Simple linear cancer risk prediction models with novel features outperform complex approaches. JCO Clin Cancer Inform. 2022; (6).-DOI: 10.1200/CCI.21.00166.

Miotto R., Li L., Kidd B.A., Dudley J.T. Deep patient: an unsupervised representation to predict the future of patients from the electronic health records. Sci Rep. 2016; (6).-DOI: 10.1038/srep26094.

Watson J., Salisbury C., Banks J., et al. Predictive value of inflammatory markers for cancer diagnosis in primary care: a prospective cohort study using electronic health records. Br J Cancer. 2019; 120(11): 1045–51.-DOI: 10.1038/s41416-019-0458-x.

Nicholson B.D., Hamilton W., Sullivan J.O’, et al. Weight loss as a predictor of cancer in primary care: A systematic review and meta-analysis. 2018; 68(670): e311–22, Royal College of General Practitioners.-DOI: 10.3399/bjgp18X695801.

Hung N., et al. Risk of cancer in patients with iron deficiency anemia: A nationwide population-based study. PLoS One. 2015; 10(3): e0119647.-DOI: 10.1371/journal.pone.0119647.

Star J., et al. Updated review of major cancer risk factors and screening test use in the United States, with a focus on changes during the COVID-19 pandemic. 2023; 32(7): 879–88. American Association for Cancer Research Inc.-DOI: 10.1158/1055-9965.EPI-23-0114.

Collins G.S., Reitsma J.B., Altman D.G., Moons K.G.M. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): The TRIPOD Statement. BMC Med. 2015; 13(1): 1.-DOI: 10.1186/s12916-014-0241-z.

Collins G.S., Moons K.G.M., Dhiman P., et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ. 2024; e078378.-DOI: 10.1136/bmj-2023-078378.

Kapoor S., Narayanan A. Leakage and the reproducibility crisis in machine-learning-based science. Patterns. 2023; 4(9): 100804.-DOI: 10.1016/j.patter.2023.100804.

Li C. Little′s test of missing completely at random. The Stata Journal. 2013; 13(4): 795–809.-DOI: 10.1177/1536867X1301300407.

Sokolova M., Lapalme G. A systematic analysis of performance measures for classification tasks. Inf Process Manag. 2009; 45(4): 427–37.-DOI: 10.1016/j.ipm.2009.03.002.

Zoubir A., Iskandler D. Bootstrap methods and applications. IEEE Signal Processing Magazine. 2007; 24(4): 10–9.-DOI: 10.1109/msp.2007.4286560.

Fischer G., Evans A.T. SpPin and SnNout are not enough. It’s Time to fully embrace likelihood ratios and probabilistic reasoning to achieve diagnostic excellence. J Gen Intern Med. 2023; 38(9): 2202–4.-DOI: 10.1007/s11606-023-08177-5.

Lundberg S.M., Erion G., Chen H., et al. From local explanations to global understanding with explainable AI for trees. Nat Mach Intell. 2020; 2(1): 56–67.-DOI: 10.1038/s42256-019-0138-9.

Van Calster B., McLernon D.J., van Smeden M., et al. Calibration: the Achilles heel of predictive analytics. BMC Med. 2019; 17(1).-DOI: 10.1186/s12916-019-1466-7.

Ding Y., Simonoff J. An investigation of missing data methods for classification trees applied to binary response data. J Mach Learn Res. 2010; 11: 131-70.-DOI: 10.5555/1756006.1756012.

Cao X.H., Stojkovic I., Obradovic Z. A robust data scaling algorithm to improve classification accuracies in biomedical data. BMC Bioinformatics. 2016; 17(1): 359.-DOI: 10.1186/s12859-016-1236-x.

de Amorim L.B.V., Cavalcanti G.D.C., Cruz R.M.O. The choice of scaling technique matters for classification performance. Appl Soft Comput. 2023; 133: 109924.-DOI: 10.1016/j.asoc.2022.109924.

Weiss G.M. Foundations of imbalanced learning. In: He H., Ma Y., eds. Imbalanced learning: foundations, algorithms, and applications. Hoboken (NJ): John Wiley & Sons. 2013; 13-41.-ISBN: 9781118074626.

Ke G., Meng Q., Finley T., et al. LightGBM: a highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst. 2017.

Breiman L. Random forests. Mach Learn. 2001; 45(1): 5-32.-DOI: 10.1023/A:1010933404324.

Akiba T., Sano S., Yanase T., et al. Optuna: a next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; 2019; 2623-31.-DOI: 10.1145/3292500.3330701.

Wester D.B. Comparing treatment means: overlapping standard errors, overlapping confidence intervals, and tests of hypothesis. Biom Biostat Int J. 2018; 7(1): 73-85.-DOI: 10.15406/bbij.2018.07.00192.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Most read articles by the same author(s)

Алексей Овсянников, Сергей Морозов, Вероника Говорухина, Вера Диденко, Ольга Пучкова, Николай Павлов, Анна Андрейченко, Наталья Ледихова, Антон Владзимирский, Перспективы использования технологий искусственного интеллекта (ИИ) в скрининге рака молочной железы , VOPROSY ONKOLOGII: 卷 66 编号 6 (2020)
Антон Барчук, А. Атрощенко, В. Гайдуков, П. Виноградов, Сергей Тараканов, Сергей Канаев, Андрей Арсеньев, Юрий Комаров, Михаил Харитонов, Алексей Барчук, Вахтанг Мерабишвили, Владимир Кузнецов, Владимир Трофимов, Наталия Гусарова, Игорь Коцюба, Алексей Беляев, Максим Подольский, А. Нефедова, АВТОМАТИЗИРОВАННАЯ ДИАГНОСТИКА В ПОПУЛЯЦИОННОМ СКРИНИНГЕ РАКА ЛЕГКОГО , VOPROSY ONKOLOGII: 卷 63 编号 2 (2017)
Татьяна Семиглазова, Наталья Буевич, Екатерина Анохина, Евгения Харченко, Михаил Осипов, Вероника Клименко, Маргарита Моталкина, Юрий Комаров, Гульфия Телетаева, Дилором Латипова, Сергей Алексеев, Лариса Филатова, Алексей Новик, Анна Семенова, Ирина Балдуева, Александр Стуков, Светлана Проценко, З. Котова, КАК ИЗМЕНИЛИСЬ ПОДХОДЫ К ЛЕКАРСТВЕННОЙ ТЕРАПИИ ЗЛОКАЧЕСТВЕННЫХ ОПУХОЛЕЙ (ПЕРВОМУ ОТЕЧЕСТВЕННОМУ ПРОТИВООПУХОЛЕВОМУ ПРЕПАРАТУ ПОСВЯЩАЕТСЯ) , VOPROSY ONKOLOGII: 卷 63 编号 2 (2017)
Ирина Балдуева, Н. Авдокина, Алексей Беляев, Александр Щербаков, Татьяна Семиглазова, Екатерина Анохина, Юрий Комаров, Дилором Латипова, Гульфия Телетаева, Анна Семенова, Ольга Галиуллина, Наталья Емельянова, Нино Пипиа, Марк Гельфонд, Светлана Проценко, Анна Данилова, Татьяна Нехаева, Алексей Новик, З. Котова, ПЕРСПЕКТИВЫ АКТИВНОЙ СПЕЦИФИЧЕСКОЙ ИММУНОТЕРАПИИ АУТОЛОГИЧНЫМИ НЕЗРЕЛЫМИ КОСТНОМОЗГОВЫМИ ДЕНДРИТНЫМИ КЛЕТКАМИ С ФОТОДИНАМИЧЕСКОЙ ТЕРАПИЕЙ И ЦИКЛОФОСФАМИДОМ У БОЛЬНЫХ ДИССЕМИНИРОВАННОЙ МЕЛАНОМОЙ, РЕЗИСТЕНТНЫХ К СТАНДАРТНЫМ МЕТОДАМ ЛЕЧЕНИЯ , VOPROSY ONKOLOGII: 卷 63 编号 2 (2017)
Антон Барчук, В. Сатиков, В. Гайдуков, Антонина Черная, Сергей Канаев, Сергей Тараканов, Юрий Комаров, Владимир Кузнецов, Андрей Арсеньев, Алексей Беляев, А. Нефедова, ПРИМЕНЕНИЕ ОНТОЛОГИИ ПРИ СКРИНИНГЕ ОНКОЛОГИЧЕСКИХ ЗАБОЛЕВАНИЙ , VOPROSY ONKOLOGII: 卷 63 编号 2 (2017)