国际妇产科学杂志 ›› 2025, Vol. 52 ›› Issue (4): 431-438.doi: 10.12280/gjfckx.20250162

• 妇科肿瘤研究:论著 • 上一篇    下一篇

基于实验室数据的卵巢恶性肿瘤机器学习诊断模型

程晓冉, 赵丹, 牛承志()   

  1. 450052 郑州大学第一附属医院妇产科(程晓冉),信息处(牛承志); 郑州大学计算机与人工智能学院(赵丹)
  • 收稿日期:2025-02-24 出版日期:2025-08-15 发布日期:2025-09-08
  • 通讯作者: 牛承志,E-mail:nczfkb@126.com

Machine Learning Diagnostic Model for Ovarian Malignancies Based on Laboratory Data

CHENG Xiao-ran, ZHAO Dan, NIU Cheng-zhi()   

  1. Department of Obstetrics and Gynecology (CHENG Xiao-ran), Department of Information(NIU Cheng-zhi), The First Affiliated Hospital of Zhengzhou University, Zhengzhou 450052, China; School of Computer Science and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China (ZHAO Dan)
  • Received:2025-02-24 Published:2025-08-15 Online:2025-09-08
  • Contact: NIU Cheng-zhi, E-mail: nczfkb@126.com

摘要:

目的:探讨基于多种实验室检测指标构建的人工智能模型用于卵巢恶性肿瘤诊断的临床应用价值。方法:回顾性分析郑州大学第一附属医院收集的3 465例卵巢恶性肿瘤和6 987例卵巢良性肿瘤患者的实验室检测数据,将患者按7∶3随机分为训练集(7 317例)和测试集(3 135例),采用Boruta特征选择法和Lasso回归法进行特征筛选,构建随机森林、逻辑回归、支持向量机和梯度提升决策树模型,并通过曲线下面积(area under the curve,AUC)、准确率、精确率、召回率和F1分数综合评估模型效能。结果:随机森林模型在测试集中表现最佳,AUC为0.924,准确率为0.872,召回率为0.749,优于其余3种模型及常用单一指标[人附睾蛋白4(human epididymis protein 4,HE4)、糖类抗原125(carbohydrate antigen 125,CA125)、CA15-3和D-二聚体]。特征重要性分析显示,HE4、CA15-3、CA19-9、CA125、CA724、甲胎蛋白、D-二聚体、纤维蛋白原、白蛋白、乳酸脱氢酶、中性粒细胞百分比、中性粒细胞绝对值、淋巴细胞百分比、淋巴细胞绝对值和血小板计数15项指标在模型中贡献显著,提示其在临床诊断中具有重要价值。结论:基于实验室检测数据构建的随机森林模型在诊断卵巢恶性肿瘤中表现出较高效能,具备良好的临床应用潜力。未来可进一步结合多中心数据和多模态信息,不断提升模型的泛化能力与临床可解释性,推动其在实际诊疗中的深化应用。

关键词: 卵巢肿瘤, 诊断, 人工智能, 机器学习, 实验室检测指标, 随机森林, 特征选择

Abstract:

Objective: To evaluate the clinical utility of an artificial intelligence (AI) model constructed from multiple laboratory indicators for diagnosing ovarian malignancies. Methods: A retrospective analysis was conducted on laboratory data from 3 465 patients with ovarian malignant tumors and 6 987 patients with benign ovarian tumors at the First Affiliated Hospital of Zhengzhou University. Patients were randomly divided into training (n=7 317) and testing (n=3 135) sets at a ratio of 7∶3. Feature selection was performed using the Boruta algorithm and Lasso regression. Four models(random forest, logistic regression, support vector machine, and gradient boosting decision tree) were constructed. Model performance was evaluated by area under the curve (AUC), accuracy, precision, recall, and F1-score. Results: The random forest model achieved optimal performance in the testing set (AUC=0.924, accuracy=0.872, recall=0.749), outperforming the other three models and conventional single biomarkers [human epididymis protein 4 (HE4), carbohydrate antigen 125 (CA125), CA15-3, and D-dimer]. Feature importance analysis identified 15 clinical significant indicators: HE4, CA15-3, CA19-9, CA125, CA724, alpha-fetoprotein, D-dimer, fibrinogen, albumin, lactate dehydrogenase, neutrophil percentage, absolute neutrophil count, lymphocyte percentage, absolute lymphocyte count, and platelet count. Conclusions: The random forest model based on laboratory data demonstrated high diagnostic efficacy for ovarian malignancies with promising clinical applicability. Future studies should incorporate multicenter data and multimodal information to enhance model generalizability and interpretability, facilitating its integration into clinical practice.

Key words: Ovarian neoplasms, Diagnosis, Artificial intelligence, Machine learning, Laboratory test indicators, Random Forest, Feature selection