- Original Article
- Open Access
Data mining in proteomic mass spectrometry
Clinical Proteomicsvolume 2, pages13–32 (2006)
Data mining application to proteomic data from mass spectrometry has gained much interest in recent years. Advances made in proteomics and mass spectrometry have resulted in considerable amount of data that cannot be easily visualized or interpreted. Mass spectral proteomic datasets are typically high dimensional but with small sample size. Consequently, advanced artificial intelligence and machine learning algorithms are increasingly being used for knowledge discovery from such datasets. Their overall goal is to extract useful information that leads to the identification of protein biomarker candidates. Such biomarkers could potentially have diagnostic value as tools for early detection, diagnosis, and prognosis of many diseases. The purpose of this review is to focus on the current trends in mining mass spectral proteomic data. Special emphasis is placed on the critical steps involved in the analysis of surface-enhanced laser desorption/ionization mass spectrometry proteomic data. Examples are drawn from previously published studies and relevant data mining terminology and techniques are exlained.
Wilkins, M. R., Sanchez, J. C., Gooley, A. A., et al. (1996) Progress with genome projects: why all proteins expressed by a genome should be identified and how to do it. Biotechnol. Genet. Eng. Rev. 13, 19–50.
Coombes, K. (2005) Analysis of mass spectrometry profiles of the serum proteome. Clin. Chem. 51, 1–2.
Rodland, K. D. (2004) Proteomics and cancer diagnosis. Clin. Bioch. 37, 579–583.
Liotta, L. A., Ardekani, A. M., Hitt, B. H., et al. (2003) General keynote: proteomic patterns in sera serve as biomarkers of ovarian cancer. Gynecol. Oncol. 88, S25-S28.
Conrads, T. P., Fusaro, V. A., Ross, S., et al. (2004) High-resolution serum proteomic features for ovarian cancer detection. Endocr. Relat. Cancer 11, 163–178.
Yip, T. C., Chan, J. W., Cho, W. C., et al. (2005) Protein chip array profiling analysis in patients with severe acute respiratory syndrome identified serum amyloid: a protein as a biomarker potentially useful in monitoring the extent of pneumonia. Clin. Chem. 51, 47–55.
Coombes, K. R., Koomen, J. M., Baggerly, K. A., Morris, J. S., and Kobayashi, R. (2005) Understanding the characteristics of mass spectrometry data through the use of simulation. Cancer Informatics 1, 41–52.
Hong, H., Dragan, Y., Epstein, J., et al. (2005) Quality control and quality assessment of data from surface-enhanced laser desorption/ionization (SELDI) time-of flight (TOF) mass spectrometry (MS). BMC Bioinformatics 15, S5.
Katardzic, M. (2002) Data Mining: Concepts, Methods, Models, and Algorithms. Wiley and IEEE Press, New York.
Wilson, R. L. and Sharda, R. (1994) Bankruptcy prediction using neural networks. Decision Support Systems 11, 545–557.
Barr, D. S. and Mani, G. (1994) Using Neural Nets to manage investments. AI Expert 1994; 16–21.
Sung, T. K., Chang, N., and Lee, G. (1999) Dynamics of modeling in data mining: interpretive approach to bankruptcy prediction. J. Manag. Info. Sys. 1, 63–85.
Shaw, M. J., Subramaniam, G., Tan, G. W., and Welge, M. E. (2001) Knowledge management and data mining for marketing. Dec. Supp. Sys. 31, 127–137.
Daskalaki, S., Kopanas, I., Goudara, M., and Avouris, N. (2003) Data mining for decision support on customer insolvency in telecommunications business. Eur. J. Oper. Res. 145, 239–255.
Haa, S. H., Baeb, S. M., and Parkb, S. C. (2002) Customer's time-variant purchase behavior and corresponding marketing strategies: an online retailer's case. Comp. and Indus. Eng. 43, 801–820.
Caskey, K. R. (2001) A manufacturing problem solving environment combining evaluation, search, and generalisation methods. Computers in Industry 44, 175–187.
Kusiak, A., Dixon, B., and Shaha, S. (2005) Predicting survival time for kidney dialysis patients: a data mining approach. Comp. Biol. Med. 35, 311–327.
Chen, W. H., Hsu, S. H., and Shen, H. P. (2005) Application of SVM and ANN for intrusion detection. Comp. and Oper. Res. 32, 2617–2634.
Seifert, J. W. (2004) Data mining and the search for security: challenges connecting the dots and databases. Government Information Quarterly 21, 461–480.
Barrera, J., Cesar, R. M., Ferreira, J. E., and Gubitoso, M. D. (2004) An environment for knowledge discovery in biology. Comp. Biol. Med. 34, 427–447.
Liu, H., Li, J., and Wong, L. (2002) A comparative study on feature selection and classification methods using gene expression profiles. Genome Informatics 13, 51–60.
Petricoin, M. F., Ardekani, A. M., Hitt, B. A., et al. (2002) Use of proteomic patterns in serum to identify ovarian cancer. Lancet 359, 572–577.
Rogers, M. A., Clarke, P., Noble, J., et al. (2003) Proteomic profiling of urinary proteins in renal cancer by surface enhanced laser desorption ionization and neural-network analysis: identification of key issues affecting potential clinical utility. Cancer Res. 63, 6971–6983.
Sorace, J. M. and Zhan, M. (2003) A data review and reassessment of ovarian cancer serum proteomic profiling. BMC Bioinformatics 4, 24.
Kozak, K. R., Amneus, M. W., Pusey, S. M., et al. (2003) Identification of biomarkers for ovarian cancer using strong anion-exchange proteinchips: potential use in diagnosis and prognosis. PNAS 100, 14,666–14,671.
Wagner, M., Naik, D. N., Pothn, A., et al. (2004) Computational protein biomarker prediction: a case study for prostate cancer. BMC Bioinformatics 5, 26.
Zhukov, T. A., Johnson, R. A., Cantor, A. B. Clark, R. A., and Tockman, M. S. (2003) Discovery of distinct protein profiles specific for lung tumors and pre-malignat lung lesions by SELDI mass spectrometry. Lung Cancer 40, 267–279
Adam, B. L., Qu, Y., Davis, J. W., et al. (2002) Serum protein finger printing coupled with a pattern-matching algorithm distinguishes prostate cancer from benign hyperplasia and healthy men. Cancer Research 62, 3609–3614.
Qu, Y., Adam, B. L., Yasui, Y., et al. (2002) Boosted decision tree analysis of surface enhanced laser desorption/ionization mass spectral serum profiles discriminates prostate cancer from noncancer patients. Clin. Chem. 48, 1835–1843.
Tourassi, G. D., Frederick, E. D., Markey, M. M., and Floyd, C. E. (2001) Application of the mutual information criterion for feature selection in computer-aided dragnosis. Med. Phys. 28, 2394–2402.
Hilario, M., Kalousis, A., Müller, M., and Pellegrini, C. (2003) Machine learning approaches to lung cancer: prediction from mass spectra. Proteomics 3, 1716–1719.
Zhu, H., Yu, C. Y., and Zhang, H. (2003) Tree based disease classification using protein data. Proteomics 3, 1673–1677.
Qu, Y., Adam, B. L., Thornquist, M., Potter, J. D., Thompson, M. L., and Yasui, Y. (2003) Data reduction using a discrete wavelet transform in discriminant analysis of very high dimensionality data. Biometrics 59, 143–151.
Li, J., Zhang, Z., Rosenzweig, J., Wang, Y. A., and Chan, D. W. (2002) Proteomics and bioinformatics approaches for identification of serum biomarkers to detect breast cancer. Clin. Chem. 47, 1296–1304.
Holland, J. H. (1994) Adaptation in Natural and Artificial Systems: An Introductory Analysis With Applications of Biology, Control and Artificial Intelligence, 3rd ed. MIT Press, Cambridge, MA.
Conrads, T. P., Zhou, M., Petricoin, E. F., Liotta, L., and Veenstra, T. D. (2003) Cancer diagnosis using proteomic patterns. Expert Rev. Mol. Diagn. 3, 411–420.
Petricoin, E. F. and Liotta, L. A. (2004) SELDI-TOF based proteomic pattern diagnostics for early detection of cancer. Curr. Opin. Biotech. 15, 24–30.
Lilien, R. H., Farid, H., and Donald, B. R. (2003) Probabilisitic disease classification of expression—dependent proteomic data from mass spectrometry of human serum. J. Comp. Biol. 10, 925–946.
Purohit, P. V. and Rocke, D. M. (2003) Discriminant models for high-throughput proteomics mass spectrometer data. Proteomics 3, 1699–1703.
Slotta, D. J., Heath, L. S., Ramakrishnan, N., Helm, R., and Potts, M. (2003) Clustering mass spectrometry data using order statistics. Proteomics 3, 1687–1691.
Coombes, K. R., Fritsche, H. A., Clarke, C., et al. (2003) Quality control and peak finding from nipple aspirate fluid by surface enhanced laser desorption and ionization. Clin. Chem. 49, 1615–1623.
Li, L., Tang, H., Wu, Z., et al. (2004) Data mining techniques for cancer detection using serum proteomic profiling. Artif. Intel. Med. 32, 71–83.
Quinlan, J. R. (1986) Introduction of decision trees. Machine Learning 1, 81–106.
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984) Classification and Regression Trees. Wadsworth International Group. Belmont, CA.
Won, Y., Song, H. J., Kang, T. W., Kim, J. J., Han, B. D., and Lee, S. W. (2003) Pattern analysis of serum proteome distinguished renal cell carcinoma from other urologic diseases and healthy persons. Proteomics 3, 2310–2316.
Markey, M. K., Tourassi, G. D., and Floyd, C. E., Jr. Decision Tree classification of proteins identified by mass spectrometry of blood samples from people with and without lung cancer. Proteomics 3, 1678–1679.
Zhang, Y. F., Wu, D. L., Liu, W. W., et al. (2004) Tree analysis of mass spectral urine profiles discriminates transitional cell carcinoma of the bladder from non cancer patient. Clin. Biochem. 37, 772–779.
Kang, X., Xu, Y., Wu, X., et al. (2005) Proteomic fingerprints for potential application to early diagnosis of severe acute respiratory syndrome. Clin. Chem. 51, 56–64.
Bishop, C. M. (1995) Neural Networks for Pattern Recognition. Oxford University Press, Oxford, UK.
Rumelhart, D., Hinton, G., and Williams, R. (1988) Learning internal representations by error propagation. In: Neurocomputing, (Anderson, J., and Rosenfeld, E.), MIT Press, Cambridge, MA, pp. 675–695.
Mian, S., Ball, G., Hornbuckle, J., et al. (2003) A prototype methodology combining surface enhanced laser desorption/ionization protein chip technology and artificial neural network algorithms to predict the chemoresponsiveness of breast cancer cell lines exposed to Paclitaxel and Doxorubicin under in vitro condition. Proteomics 3, 1725–1737.
Ball, G., Mian, S., Allibone, R. O., et al. (2002) An integrated approach using artificial neural networks and SELDI mass spectrometry for the classification of human tumors and rapid identification of potential biomarkers. Bioinformatics 18, 395–404.
Poon, T. C. W., Yip, T., Chan, A. T. C., Yip, C., Yip, V., and Mok, T. S. K. (2003) Comprehensive proteomic profiling identifies serum proteomic signatures for detection of hepatocellular carcinoma and its subtypes. Clin. Chem. 49, 752–760.
Kohonen, T. (1995) Self Organizing Maps. Springer Publishers, Berlin, Germany.
Breiman, L. (1996) Bagging predictors. Machine Learning 24, 123–140.
Izmirilan, G. (2004) Application of random forest classification algorithm to a SELDI-TOF Proteomics study in the setting of a cancer prevention trial. Ann. NY Acad. Sci. 1020, 154–174.
Tourassi, G. D. and Floyd, C. E. (1997) The effect of data sampling on the performance evaluation of artificial neural networks in medical diagnosis. Med. Dec. Mak. 17, 186–192.
Wilson, L. L., Tran, L., Morton, D. L., and Hoon, D. S. B. (2004) Detection of Differentially expressed proteins in early-stage melanoma patients using SELDI-TOF mass spectrometry. Ann. NY Acad. Sci. 1022, 317–322.
Tatay, J. W., Feng, X., Sobczak, N., et al. (2003) Multiple approaches to data mining of proteomic data based on statistical and pattern classification methods. Proteomics 3, 1704–1709.