A comprehensive analysis of classification algorithms for cancer prediction from gene expression

A comprehensive analysis of classification algorithms for cancer prediction from gene expression

Raehoon Jeong and Ameet Soni

With the advent of inexpensive microarray technology, biologists have become increasingly reliant on gene expression analysis for detecting disease states, including diagnosis of cancerous tissue (Tan et al, 2003). Microarray data sets are highly susceptible to the curse of dimensionality, as most have orders of magnitude more gene measurements (i.e., features) than samples (i.e., instances). Therefore, the classification algorithm must be robust with noisy and redundant data. While random forests (Breiman, 2001) and SVMs (Vapnik, 1998) have proven to be popular methods for expression analysis (Diaz et al., 2006; Statnikov et al., 2008) little work has been done to compare these methods with AdaBoost (Freund, 1997), a popular ensemble learning algorithm, across a wide array of cancer prediction tasks.

Our initial analysis compared several classifiers, including $k$-nearest neighbors, linear discriminant analysis, linear SVMs, among others on 24 microarray data sets (12 binary and 12 multi-class). However, the three algorithms mentioned showed the best results over others. Furthermore, our analysis shows that AdaBoost performs remarkably well on binary tasks, generally outperforming both SVMs and random forests. On multi-class problems, however, random forests and SVMs are indistinguishable from one another but generally outperform AdaBoost.

Our work differed from existing research in two important ways. First, our work provides a comprehensive analysis of the AdaBoost algorithm across a wide variety of cancer data sets. Second, our work shows that the choice of optimal algorithm depends heavily on whether the task falls under binary prediction or multi-class prediction.

Literature cited:
[1] L. Breiman. Random forests. Machine Learning, 45(1):5-32, 2001.
[2] R. Diaz-Uriarte and S. A. De Andres. Gene selection and classificatio
 of microarray data using random forest. BMC Bioinformatics, 7(1):3, 2006.
[3] Y. Freund and R. E. Schapire. A decision-theoretic generalization of 
on-line learning and an application to boosting. Journal of Computer and 
System Sciences, 55(1):119-139, 1997.
[4] A. Statnikov, L. Wang, and C. F. Aliferis. A comprehensive comparison
 of random forests and support vector machines for microarray-based cancer 
classification. BMC Bioinformatics, 9(1):319, 2008.
[5] A. C. Tan and D. Gilbert. Ensemble machine learning on gene expression 
data for cancer classification. Applied Bioinformatics, 2:S75-83, 2003.
[6] V. N. Vapnik. Statistical Learning Theory, volume 1. Wiley New York, 1998.