A data analysis and visualization platform based on machine learning


  • Overview

  •     Step1: Data Input

  •     Step2: Method Selection

  •     Step3: Model Generation

  •     Step4: Feature Browse

  •     Step5: Model Evaluation

  •     Step6: Feature Filtration

  •     Step7: Model Application



  • The machine learning analysis function of the database can perform binary and multi-classification analysis based on 15 algorithms and survival analysis based on 11 algorithms.

  • The functional modules include "Model Generation", "Feature Browse", "Model Evaluation" and "Model Application", as illustrated in the left figure. And the specific steps are shown in the navigation bar. (The navigation bar is only used to display the current step and is not for page turning)

  • Click the "Next Step" and "Last Step" buttons at the bottom to turn pages. Click the other buttons here for analysis.

  • Each step and module offers multiple visualization approaches and free download services.

  • The "Model Application" module is an optional function. When drawing a nomogram, there is no need to upload the prediction set.

  • Users can upload the Training Set and Validation Set separately, or upload a comprehensive dataset in the "Step1: Data Input" and set a proportion for random division to obtain the Training Set and Validation Set.


Basic Parameter



*Analysis Type:

*The generation way for validation set:

*Missing Value Treatment:



Data Set



*Upload Matrix File:

*Upload Group Information File:   


*Proportion of Division:   training : validation =  7 : 3    


  • After clicking the "Submit" button, the randomly divided data set can be downloaded for the purpose of reproducing the results when submitting again.


Training Set



*Upload Matrix File:

*Upload Group Information File:   




Validation Set



*Upload Matrix File:

*Upload Group Information File:   




  • Please make sure the input file format is correct to avoid errors that caused no results.

  • The contents of the files should be separated by tab. If the IDs contain special characters, they will be automatically replaced by underscores.

  • In the upload matrix file, each line should be a feature and each column should be a sample. Feature IDs and sample IDs should not be duplicated; otherwise, it may lead to no results. And the values in the matrix should be numerical.

  • For binary and multiple classification analysis, the group information file should have two columns and the headers should be "sample" and "condition". For binary classification analysis, the value of the "condition" should be "case" or "control".

  • For survival analysis, the group information file should have three columns and the headers should be "sample", "time" and "status". The unit of the "time" column should be days, and the value of the "status" column should be "0" or "1".

  • Too many features and samples will result in slow running time. And some ML algorithms are time-consuming, please be patient.

  • If some ML algorithms do not produce results of "Model Generation" or "Model Evaluation", users are advised to adjust the parameters and try again.

  • Please note that these algorithms may not be able to model all data structures successfully. If the format of the uploaded data is correct, please focus more on the successful algorithm for modeling.





   

Algorithm and Parameter Selection   Select All    


  XGBoost

Model:

Eta: 0.

Max depth: 

Objective:
   binary:logistic
   multi:softmax
   count:poisson

The number of decision trees to display:

Show the top features

Extreme Gradient Boosting, which is an efficient implementation of the gradient boosting framework from Chen & Guestrin (2016). XGBoost includes efficient linear model solver and tree learning algorithms. The package can automatically do parallel computation on a single machine which could be more than 10 times faster than existing gradient boosting packages. It supports various objective functions, including regression, classification and ranking. The package is made to be extensible, so that users are also allowed to define their own objectives easily.
Detail.


  LightGBM

Model:

Learning Rate: 0.

Objective: 

Show the top features

LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient with the following advantages:
  • Faster training speed and higher efficiency.
  • Lower memory usage.
  • Better accuracy.
  • Support of parallel, distributed, and GPU learning.
  • Capable of handling large-scale data.
  • For further details, please refer to Features.
Benefiting from these advantages, LightGBM is being widely-used in many winning solutions of machine learning competitions.
Comparison experiments on public datasets show that LightGBM can outperform existing boosting frameworks on both efficiency and accuracy, with significantly lower memory consumption. What's more, distributed learning experiments show that LightGBM can achieve a linear speed-up by using multiple machines for training in specific settings.
Detail.


  GBM

Model:

 bernoulli
 gaussian

Number of trees:

Learning rate (0.001~0.1):

Show the top features


  • Too small dataset may result in no result.
  • A smaller learning rate typically requires more trees.
GBM (Gradient Boosting Machine) algorithm is one of Boosting algorithm. The main idea is that multiple weak learners are generated serially, and the goal of each weak learner is to fit the negative gradient of the loss function of the previously accumulated model so that the cumulative model loss after adding the weak learner decreases in the direction of the negative gradient. And it combines the base learners linearly with different weights, so that the excellent learners can be reused. The most common base learner is the tree model.
Detail.


  Random Forest

Number of Trees:  

Random forest (RF) is an ensemble classifier and consisting of many DTs (decision trees) similar to the way a forest is a collection of many trees. DTs that are grown very deep often cause overfitting of the training data, resulting a high variation in classification outcome for a small change in the input data. They are very sensitive to their training data, which makes them error-prone to the test dataset. The different DTs of an RF are trained using the different parts of the training dataset. To classify a new sample, the input vector of that sample is required to pass down with each DT of the forest. Each DT then considers a different part of that input vector and gives a classification outcome. The forest then chooses the classification of having the most 'votes' (for discrete classification outcome) or the average of all trees in the forest (for numeric classification outcome). Since the RF algorithm considers the outcomes from many different DTs, it can reduce the variance resulted from the consideration of a single DT for the same dataset.
Detail.


  CatBoost

Model:

Number of iterations:

Loss function:

Show the top features

CatBoost is a fast, scalable, high performance gradient boosting on decision trees library. Used for ranking, classification, regression and other ML tasks.
Detail.


  AdaBoost

Number of iterations:

Show the top features

M1 algorithm and Breiman's Bagging algorithm using classification trees as individual classifiers. Once these classifiers have been trained, they can be used to predict on new data. Also, cross validation estimation of the error can be done.
Detail.


  Decision Tree

Show the top features

A decision tree models the decision logics i.e., tests and corresponds outcomes for classifying data items into a tree-like structure. The nodes of a DT tree normally have multiple levels where the first or top-most node is called the root node. All internal nodes (i.e., nodes having at least one child) represent tests on input variables or attributes. Depending on the test outcome, the classification algorithm branches towards the appropriate child node where the process of test and branching repeats until it reaches the leaf node. The leaf or terminal nodes correspond to the decision outcomes. When traversing the tree for the classification of a sample, the outcomes of all tests at each node along the path will provide sufficient information to conjecture about its class.


  Lasso

Model:

 binomial
 gaussian
 poisson

Cross Validation:

 fold

Show the top features

Compared with the quadratic penalty function of ridge regression, lasso's first penalty function can not only shrink the non-0 predictors βj to 0, but also select the valuable predictors (|βj| with a large value). This is because compared with the quadratic penalty function of ridge regression, lasso's first penalty function has a smaller contraction degree on the variable coefficient port βj, so lasso can select a more accurate model.
Detail.


  Elastic Network

Model:

 binomial
 gaussian
 poisson

Cross Validation:

 fold

Alpha:

0.

Show the top features

The penalty function of the elastic network is a convex linear combination of the ridge regression penalty function and the Lasso penalty function. When α=0, elastic network regression is ridge regression. When α=1, elastic network regression is lasso regression. Therefore, elastic net regression has the advantages of lasso regression and ridge regression, which can not only achieve the purpose of variable selection, but also have a good group effect.
Detail.


  Ridge

Model:

 binomial
 gaussian
 poisson

Cross Validation:

 fold

Show the top features

Ridge regression is a biased estimation regression method dedicated to collinear data analysis. In fact, it is an improved least squares estimation method. By giving up the unbias of least squares method, Ridge regression can obtain a regression method with more realistic and reliable regression coefficient at the cost of losing some information and reducing accuracy. The fit of ill-conditioned data is better than the least square method.
Detail.


  PLS

Show the top features

PLS (partial least squares) regression uses the principle of principal component analysis to condense multiple X and multiple Y into components (X corresponds to principal component U, Y corresponds to principal component V), and then with the help of the typical correlation principle, the relationship between X and U, Y and V can be analyzed. And combined with the principle of multiple linear regression, analyze the relationship between X and V, so as to study the relationship between X and Y.
Detail.


  GLM

Model:

 binomial    (Logistic Regression)
 gaussian   (Linear Regression)
 poisson     (Poisson Regression)

Show the top features

In GLM (generalized linear model), logistic regression is used for analysis of binary-class, linear regression is used for analysis of multi-class.
Detail.


  Neural Network

Model:

Hidden: 

Activation Function: 

Algorithm: 

Show the top features

Neural networks (NN) attempt to use multiple layers of calculations to imitate the concept of how the human brain interprets and draws conclusions from information. NN are essentially mathematical models which are designed to deal with complex and disparate information, and the nomenclature of this algorithm comes from its use of 'nodes' akin to synapses in the brain. The learning process of a NN can either be supervised or unsupervised. A neural net is said to learn in a supervised manner if the desired output is already targeted and introduced to the network by training data whereas unsupervised NN have no such preidentified target outputs and the goal is to group similar units close together in certain areas of the value range. The process of learning used in MLBiomarker is supervised.
Detail.


  SVM-RFE

Cross Validation:

K fold:       N fold: 

Show the top features

SVM (support vector machine) first maps each data item into an n-dimensional feature space where n is the number of features. It then identifies the hyperplane that separates the data items into two classes while maximising the marginal distance for both classes and minimising the classification errors. The marginal distance for a class is the distance between the decision hyperplane and its nearest instance which is a member of that class. More formally, each data point is plotted first as a point in an n-dimension space (where n is the number of features) with the value of each feature being the value of a specific coordinate. To perform the classification, we then need to find the hyperplane that differentiates the two classes by the maximum margin.
Detail.


  SuperPC

Show the top features

Does prediction in the case of a censored survival outcome, or a regression outcome, using the "supervised principal component" approach (Bair et al., 2006).
Superpc is especially useful for high-dimensional data when the number of features p dominates the number of samples n (p >> n paradigm), as generated, for instance, by high-throughput technologies.
Detail.


  XGBoost Cox

Model:

Eta: 0.

Max depth: 

Objective:  survival:cox

The number of decision trees to display:

Show the top features

Extreme Gradient Boosting, which is an efficient implementation of the gradient boosting framework from Chen & Guestrin (2016). XGBoost includes efficient linear model solver and tree learning algorithms. The package can automatically do parallel computation on a single machine which could be more than 10 times faster than existing gradient boosting packages. It supports various objective functions, including regression, classification and ranking. The package is made to be extensible, so that users are also allowed to define their own objectives easily.
Detail.


  GBM

Model:

Number of trees:

Learning rate (0.001~0.1):

Show the top features


  • Too small dataset may result in no result.
  • A smaller learning rate typically requires more trees.
GBM (Gradient Boosting Machine) algorithm is one of Boosting algorithm. The main idea is that multiple weak learners are generated serially, and the goal of each weak learner is to fit the negative gradient of the loss function of the previously accumulated model so that the cumulative model loss after adding the weak learner decreases in the direction of the negative gradient. And it combines the base learners linearly with different weights, so that the excellent learners can be reused. The most common base learner is the tree model.
Detail.


  Random Survival Forest

Number of Trees:  

Random survival forest (RSF) is a random forest method for analyzing right censored survival data. It introduces new survival splitting rules for growing survival trees, and new missing data algorithms for estimating missing data.
RSF introduced the event retention principle for surviving forests and used it to define overall mortality, a simple interpretable mortality measure that can be used to predict outcomes.
Detail.


  CoxBoost

Show the top features

CoxBoost provides routines for fitting Cox models by likelihood based boosting for a single endpoint or in presence of competing risks.
Detail.


  StepCox

Model:

 bidirection
 forward
 backward

Show the top features

Stepwise regression is a method of fitting regression models in which the choice of predictive variables is carried out by an automatic procedure. Three stepwise regression can be chosen, i.e. stepwise linear regression, stepwise logistic regression and stepwise cox regression.
Detail.


  Lasso Cox

Model:

 cox

Cross Validation:

 fold

Show the top features

Compared with the quadratic penalty function of ridge regression, lasso's first penalty function can not only shrink the non-0 predictors βj to 0, but also select the valuable predictors (|βj| with a large value). This is because compared with the quadratic penalty function of ridge regression, lasso's first penalty function has a smaller contraction degree on the variable coefficient port βj, so lasso can select a more accurate model.
Detail.


  Elastic Network Cox

Model:

 cox

Cross Validation:

 fold

Alpha:

0.

Show the top features

The penalty function of the elastic network is a convex linear combination of the ridge regression penalty function and the Lasso penalty function. When α=0, elastic network regression is ridge regression. When α=1, elastic network regression is lasso regression. Therefore, elastic net regression has the advantages of lasso regression and ridge regression, which can not only achieve the purpose of variable selection, but also have a good group effect.
Detail.


  Ridge Cox

Model:

 cox

Cross Validation:

 fold

Show the top features

Ridge regression is a biased estimation regression method dedicated to collinear data analysis. In fact, it is an improved least squares estimation method. By giving up the unbias of least squares method, Ridge regression can obtain a regression method with more realistic and reliable regression coefficient at the cost of losing some information and reducing accuracy. The fit of ill-conditioned data is better than the least square method.
Detail.


  plsRcox

Model:

Number of Components:  

Cross Validation:

 fold

Show the top features


  • Too high number of components may result in no result.
plsRcox implements partial least squares Regression and various regular, sparse or kernel, techniques for fitting Cox models in high dimensional settings. Cross validation criteria were studied in Bertrand's research.
Detail.


  SuperPC

Show the top features

Does prediction in the case of a censored survival outcome, or a regression outcome, using the "supervised principal component" approach (Bair et al., 2006).
Superpc is especially useful for high-dimensional data when the number of features p dominates the number of samples n (p >> n paradigm), as generated, for instance, by high-throughput technologies.
Detail.


  Cox (Univariate / Multivariate)

Model:

(Univariate Cox) P value < 0.

(Multivariate Cox) P value < 0.

Show the top features

The main purpose of survival analysis is to study the relationship between covariate (independent variable) X and the observed survival function S(t,X). When S(t,X) is affected by covariate, the traditional method is to consider regression analysis, that is, the influence of various covariables on S(t,X). Because the survival data contains truncated data, it is difficult to solve the above problems with general regression analysis. A very important content of survival analysis is to explore the risk factors affecting survival time or survival rate, which can affect the survival rate by affecting the risk of death at various times (that is, the risk rate). The risk rate function of different populations at different times is different, and the risk rate function is usually expressed as the product of the baseline risk rate function and the corresponding covariate function. In 1972, British biostatistician D. Cox proposed a method for estimating model parameters when the baseline risk rate function was unknown. Later generations called this model Cox proportional risk regression model, referred to as Cox regression.
Detail.



   


  • Click the button on the right navigation bar to obtain the result window of the corresponding ML algorithm. The result window contains sub-pages including "Importance Scores" and other visualizations.
  • The numbers in the right navigation bar represent the number of top/all features in models.
  • The top feature will be displayed on the "Importance Score" subpage in the result window of each ML algorithm, and the number can be adjusted by user independently.
  • For some algorithms, it is necessary to select parameters in the result window based on the intermediate results, and the results can be obtained after submission.
  • In the subsequent steps, "Feature Browse" will display the intersection and union of top features.
  • In the "Model Evaluation" step, all features will be used to conduct an overall assessment of the model, and the top features will be evaluated at the feature level.


  XGBoost



  LightGBM



  GBM



  Random Forest



  CatBoost



  AdaBoost



  Decision Tree



  Lasso



  Elastic Network



  Ridge



  PLS



  GLM



  Neural Network



  SVM-RFE



  SuperPC



  XGBoost Cox



  GBM



  Random Survival Forest



  CoxBoost



  StepCox



  Lasso Cox



  Elastic Network Cox



  Ridge Cox



  plsRcox



  SuperPC



  Cox



  • XGBoost
  • LightGBM
  • GBM
  • Random Forest
  • CatBoost
  • AdaBoost
  • Decision Tree
  • Lasso
  • Elastic Network
  • Ridge
  • PLS
  • GLM
  • Neural Network
  • SVM-RFE
  • SuperPC
  • XGBoost Cox
  • GBM
  • Random Survival Forest
  • CoxBoost
  • StepCox
  • Lasso Cox
  • Elastic Network Cox
  • Ridge Cox
  • plsRcox
  • SuperPC
  • Cox