Comparison with the Original Studies
Binary Classification
-
Reference:
DOI:10.1038/s41591-023-02396-3
- The top half of the figure below shows the results of the evaluation of the models. Among them, Random Forest, XGBoost, LightGBM and AdaBoost have the best evaluation effect, with AUC values above 0.75, and the accuracy of their prediction results above 90%.
- Although the prediction results of Ridge, PLS, Lasso and other methods show certain accuracy, the prediction results of case samples show a very high error rate, so the models constructed by these methods are not reliable. The corresponding AUC value is only about 0.5, which does not have the ability to predict.
- The bottom half of the figure below shows the ranking of the most important features in the original study among the MLome analysis results.
- In the study of Al-Zaiti et al., Random Forest also stands out among 10 machine learning methods and achieves the best evaluation effect. Moreover, features such as st80_III, st80_aVL, STT_PCAratio, st80_V2, and TpTe not only rank among the top ten in the research results of Al-Zaiti et al, but also they rank high in importance among the analysis results of various methods in MLome.
Multiple Classification
-
Reference:
DOI:10.1186/gb-2014-15-2-r24
- In the evaluation results, the performance of all models has reached a good evaluation effect.
- There are 12 methods with AUC values above 0.9, and the relatively low Decision Tree and CatBoost are as high as 0.89 and 0.85.
- The correlation R square of all models also reached a high level, and the relatively low Neural network and Decision tree were as high as 0.89 and 0.87, indicating that these models had a high consistency between the predicted age of the validation set sample and its real age.
- CpG loci such as cg09809672, cg25809905, cg15379633, cg02228185, cg16386080, cg23124451, cg15804973 not only ranked among the top ten in the model reported by Carola et al, but also ranked among the top in the weight of multiple models in MLome, which means the consistency of the analysis results of both sides.
Survival Analysis
-
Reference:
DOI:10.1158/1078-0432.CCR-16-0511
(Training set),
DOI:10.1158/1078-0432.CCR-13-0209
(Validation set)
- The figure on the right shows the features derived from Wang's analysis. The figure on the left shows the distribution of these genes in MLome analysis.
- The features identified by plsRcox, Ridge, SuperPC survival, and Random survival forest models was highly consistent with the results obtained by Wang et al. And these genes are generally ranked high in the weight of SuperPC survival results, showing the consistency of the results of both studies.
- We selected the top 30 most important genes in each model as top features, and obtained a set of top features with 210 genes. Among them, KRT7, SFN, ITGA3, TPX2, LOXL2 and other genes were also identified as features of pancreatic cancer by Wang et al. In addition, although CDA, HK1, ADAMTS14 and other genes were not included in the analysis results of Wang et al., they were also reported as key genes related to pancreatic cancer.