Skip to main content

Data mining in proteomic mass spectrometry


Data mining application to proteomic data from mass spectrometry has gained much interest in recent years. Advances made in proteomics and mass spectrometry have resulted in considerable amount of data that cannot be easily visualized or interpreted. Mass spectral proteomic datasets are typically high dimensional but with small sample size. Consequently, advanced artificial intelligence and machine learning algorithms are increasingly being used for knowledge discovery from such datasets. Their overall goal is to extract useful information that leads to the identification of protein biomarker candidates. Such biomarkers could potentially have diagnostic value as tools for early detection, diagnosis, and prognosis of many diseases. The purpose of this review is to focus on the current trends in mining mass spectral proteomic data. Special emphasis is placed on the critical steps involved in the analysis of surface-enhanced laser desorption/ionization mass spectrometry proteomic data. Examples are drawn from previously published studies and relevant data mining terminology and techniques are exlained.


  1. 1

    Wilkins, M. R., Sanchez, J. C., Gooley, A. A., et al. (1996) Progress with genome projects: why all proteins expressed by a genome should be identified and how to do it. Biotechnol. Genet. Eng. Rev. 13, 19–50.

    PubMed  CAS  Google Scholar 

  2. 2

    Coombes, K. (2005) Analysis of mass spectrometry profiles of the serum proteome. Clin. Chem. 51, 1–2.

    PubMed  Article  CAS  Google Scholar 

  3. 3

    Rodland, K. D. (2004) Proteomics and cancer diagnosis. Clin. Bioch. 37, 579–583.

    Article  CAS  Google Scholar 

  4. 4

    Liotta, L. A., Ardekani, A. M., Hitt, B. H., et al. (2003) General keynote: proteomic patterns in sera serve as biomarkers of ovarian cancer. Gynecol. Oncol. 88, S25-S28.

    PubMed  Article  Google Scholar 

  5. 5

    Conrads, T. P., Fusaro, V. A., Ross, S., et al. (2004) High-resolution serum proteomic features for ovarian cancer detection. Endocr. Relat. Cancer 11, 163–178.

    PubMed  Article  CAS  Google Scholar 

  6. 6

    Yip, T. C., Chan, J. W., Cho, W. C., et al. (2005) Protein chip array profiling analysis in patients with severe acute respiratory syndrome identified serum amyloid: a protein as a biomarker potentially useful in monitoring the extent of pneumonia. Clin. Chem. 51, 47–55.

    PubMed  Article  CAS  Google Scholar 

  7. 7

    Coombes, K. R., Koomen, J. M., Baggerly, K. A., Morris, J. S., and Kobayashi, R. (2005) Understanding the characteristics of mass spectrometry data through the use of simulation. Cancer Informatics 1, 41–52.

    CAS  Google Scholar 

  8. 8

    Hong, H., Dragan, Y., Epstein, J., et al. (2005) Quality control and quality assessment of data from surface-enhanced laser desorption/ionization (SELDI) time-of flight (TOF) mass spectrometry (MS). BMC Bioinformatics 15, S5.

    Article  Google Scholar 

  9. 9

    Katardzic, M. (2002) Data Mining: Concepts, Methods, Models, and Algorithms. Wiley and IEEE Press, New York.

    Google Scholar 

  10. 10

    Wilson, R. L. and Sharda, R. (1994) Bankruptcy prediction using neural networks. Decision Support Systems 11, 545–557.

    Article  Google Scholar 

  11. 11

    Barr, D. S. and Mani, G. (1994) Using Neural Nets to manage investments. AI Expert 1994; 16–21.

    Google Scholar 

  12. 12

    Sung, T. K., Chang, N., and Lee, G. (1999) Dynamics of modeling in data mining: interpretive approach to bankruptcy prediction. J. Manag. Info. Sys. 1, 63–85.

    Google Scholar 

  13. 13

    Shaw, M. J., Subramaniam, G., Tan, G. W., and Welge, M. E. (2001) Knowledge management and data mining for marketing. Dec. Supp. Sys. 31, 127–137.

    Article  Google Scholar 

  14. 14

    Daskalaki, S., Kopanas, I., Goudara, M., and Avouris, N. (2003) Data mining for decision support on customer insolvency in telecommunications business. Eur. J. Oper. Res. 145, 239–255.

    Article  Google Scholar 

  15. 15

    Haa, S. H., Baeb, S. M., and Parkb, S. C. (2002) Customer's time-variant purchase behavior and corresponding marketing strategies: an online retailer's case. Comp. and Indus. Eng. 43, 801–820.

    Article  Google Scholar 

  16. 16

    Caskey, K. R. (2001) A manufacturing problem solving environment combining evaluation, search, and generalisation methods. Computers in Industry 44, 175–187.

    Article  Google Scholar 

  17. 17

    Kusiak, A., Dixon, B., and Shaha, S. (2005) Predicting survival time for kidney dialysis patients: a data mining approach. Comp. Biol. Med. 35, 311–327.

    Article  Google Scholar 

  18. 18

    Chen, W. H., Hsu, S. H., and Shen, H. P. (2005) Application of SVM and ANN for intrusion detection. Comp. and Oper. Res. 32, 2617–2634.

    Article  Google Scholar 

  19. 19

    Seifert, J. W. (2004) Data mining and the search for security: challenges connecting the dots and databases. Government Information Quarterly 21, 461–480.

    Article  Google Scholar 

  20. 20

    Barrera, J., Cesar, R. M., Ferreira, J. E., and Gubitoso, M. D. (2004) An environment for knowledge discovery in biology. Comp. Biol. Med. 34, 427–447.

    Article  Google Scholar 

  21. 21

    Liu, H., Li, J., and Wong, L. (2002) A comparative study on feature selection and classification methods using gene expression profiles. Genome Informatics 13, 51–60.

    PubMed  CAS  Google Scholar 

  22. 22

    Petricoin, M. F., Ardekani, A. M., Hitt, B. A., et al. (2002) Use of proteomic patterns in serum to identify ovarian cancer. Lancet 359, 572–577.

    PubMed  Article  CAS  Google Scholar 

  23. 23

    Rogers, M. A., Clarke, P., Noble, J., et al. (2003) Proteomic profiling of urinary proteins in renal cancer by surface enhanced laser desorption ionization and neural-network analysis: identification of key issues affecting potential clinical utility. Cancer Res. 63, 6971–6983.

    PubMed  CAS  Google Scholar 

  24. 24

    Sorace, J. M. and Zhan, M. (2003) A data review and reassessment of ovarian cancer serum proteomic profiling. BMC Bioinformatics 4, 24.

    PubMed  Article  Google Scholar 

  25. 25

    Kozak, K. R., Amneus, M. W., Pusey, S. M., et al. (2003) Identification of biomarkers for ovarian cancer using strong anion-exchange proteinchips: potential use in diagnosis and prognosis. PNAS 100, 14,666–14,671.

    Article  Google Scholar 

  26. 26

    Wagner, M., Naik, D. N., Pothn, A., et al. (2004) Computational protein biomarker prediction: a case study for prostate cancer. BMC Bioinformatics 5, 26.

    PubMed  Article  Google Scholar 

  27. 27

    Zhukov, T. A., Johnson, R. A., Cantor, A. B. Clark, R. A., and Tockman, M. S. (2003) Discovery of distinct protein profiles specific for lung tumors and pre-malignat lung lesions by SELDI mass spectrometry. Lung Cancer 40, 267–279

    PubMed  Google Scholar 

  28. 28

    Adam, B. L., Qu, Y., Davis, J. W., et al. (2002) Serum protein finger printing coupled with a pattern-matching algorithm distinguishes prostate cancer from benign hyperplasia and healthy men. Cancer Research 62, 3609–3614.

    PubMed  CAS  Google Scholar 

  29. 29

    Qu, Y., Adam, B. L., Yasui, Y., et al. (2002) Boosted decision tree analysis of surface enhanced laser desorption/ionization mass spectral serum profiles discriminates prostate cancer from noncancer patients. Clin. Chem. 48, 1835–1843.

    PubMed  CAS  Google Scholar 

  30. 30

    Tourassi, G. D., Frederick, E. D., Markey, M. M., and Floyd, C. E. (2001) Application of the mutual information criterion for feature selection in computer-aided dragnosis. Med. Phys. 28, 2394–2402.

    PubMed  Article  CAS  Google Scholar 

  31. 31

    Hilario, M., Kalousis, A., Müller, M., and Pellegrini, C. (2003) Machine learning approaches to lung cancer: prediction from mass spectra. Proteomics 3, 1716–1719.

    PubMed  Article  CAS  Google Scholar 

  32. 32

    Zhu, H., Yu, C. Y., and Zhang, H. (2003) Tree based disease classification using protein data. Proteomics 3, 1673–1677.

    PubMed  Article  CAS  Google Scholar 

  33. 33

    Qu, Y., Adam, B. L., Thornquist, M., Potter, J. D., Thompson, M. L., and Yasui, Y. (2003) Data reduction using a discrete wavelet transform in discriminant analysis of very high dimensionality data. Biometrics 59, 143–151.

    PubMed  Article  Google Scholar 

  34. 34

    Li, J., Zhang, Z., Rosenzweig, J., Wang, Y. A., and Chan, D. W. (2002) Proteomics and bioinformatics approaches for identification of serum biomarkers to detect breast cancer. Clin. Chem. 47, 1296–1304.

    Google Scholar 

  35. 35

    Holland, J. H. (1994) Adaptation in Natural and Artificial Systems: An Introductory Analysis With Applications of Biology, Control and Artificial Intelligence, 3rd ed. MIT Press, Cambridge, MA.

    Google Scholar 

  36. 36

    Conrads, T. P., Zhou, M., Petricoin, E. F., Liotta, L., and Veenstra, T. D. (2003) Cancer diagnosis using proteomic patterns. Expert Rev. Mol. Diagn. 3, 411–420.

    PubMed  Article  CAS  Google Scholar 

  37. 37

    Petricoin, E. F. and Liotta, L. A. (2004) SELDI-TOF based proteomic pattern diagnostics for early detection of cancer. Curr. Opin. Biotech. 15, 24–30.

    PubMed  Article  CAS  Google Scholar 

  38. 38

    Lilien, R. H., Farid, H., and Donald, B. R. (2003) Probabilisitic disease classification of expression—dependent proteomic data from mass spectrometry of human serum. J. Comp. Biol. 10, 925–946.

    Article  CAS  Google Scholar 

  39. 39

    Purohit, P. V. and Rocke, D. M. (2003) Discriminant models for high-throughput proteomics mass spectrometer data. Proteomics 3, 1699–1703.

    PubMed  Article  CAS  Google Scholar 

  40. 40

    Slotta, D. J., Heath, L. S., Ramakrishnan, N., Helm, R., and Potts, M. (2003) Clustering mass spectrometry data using order statistics. Proteomics 3, 1687–1691.

    PubMed  Article  CAS  Google Scholar 

  41. 41

    Coombes, K. R., Fritsche, H. A., Clarke, C., et al. (2003) Quality control and peak finding from nipple aspirate fluid by surface enhanced laser desorption and ionization. Clin. Chem. 49, 1615–1623.

    PubMed  Article  CAS  Google Scholar 

  42. 42

    Li, L., Tang, H., Wu, Z., et al. (2004) Data mining techniques for cancer detection using serum proteomic profiling. Artif. Intel. Med. 32, 71–83.

    Article  Google Scholar 

  43. 43

    Quinlan, J. R. (1986) Introduction of decision trees. Machine Learning 1, 81–106.

    Google Scholar 

  44. 44

    Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984) Classification and Regression Trees. Wadsworth International Group. Belmont, CA.

    Google Scholar 

  45. 45

    Won, Y., Song, H. J., Kang, T. W., Kim, J. J., Han, B. D., and Lee, S. W. (2003) Pattern analysis of serum proteome distinguished renal cell carcinoma from other urologic diseases and healthy persons. Proteomics 3, 2310–2316.

    PubMed  Article  CAS  Google Scholar 

  46. 46

    Markey, M. K., Tourassi, G. D., and Floyd, C. E., Jr. Decision Tree classification of proteins identified by mass spectrometry of blood samples from people with and without lung cancer. Proteomics 3, 1678–1679.

  47. 47

    Zhang, Y. F., Wu, D. L., Liu, W. W., et al. (2004) Tree analysis of mass spectral urine profiles discriminates transitional cell carcinoma of the bladder from non cancer patient. Clin. Biochem. 37, 772–779.

    PubMed  Article  CAS  Google Scholar 

  48. 48

    Kang, X., Xu, Y., Wu, X., et al. (2005) Proteomic fingerprints for potential application to early diagnosis of severe acute respiratory syndrome. Clin. Chem. 51, 56–64.

    PubMed  Article  CAS  Google Scholar 

  49. 49

    Bishop, C. M. (1995) Neural Networks for Pattern Recognition. Oxford University Press, Oxford, UK.

    Google Scholar 

  50. 50

    Rumelhart, D., Hinton, G., and Williams, R. (1988) Learning internal representations by error propagation. In: Neurocomputing, (Anderson, J., and Rosenfeld, E.), MIT Press, Cambridge, MA, pp. 675–695.

    Google Scholar 

  51. 51

    Mian, S., Ball, G., Hornbuckle, J., et al. (2003) A prototype methodology combining surface enhanced laser desorption/ionization protein chip technology and artificial neural network algorithms to predict the chemoresponsiveness of breast cancer cell lines exposed to Paclitaxel and Doxorubicin under in vitro condition. Proteomics 3, 1725–1737.

    PubMed  Article  CAS  Google Scholar 

  52. 52

    Ball, G., Mian, S., Allibone, R. O., et al. (2002) An integrated approach using artificial neural networks and SELDI mass spectrometry for the classification of human tumors and rapid identification of potential biomarkers. Bioinformatics 18, 395–404.

    PubMed  Article  CAS  Google Scholar 

  53. 53

    Poon, T. C. W., Yip, T., Chan, A. T. C., Yip, C., Yip, V., and Mok, T. S. K. (2003) Comprehensive proteomic profiling identifies serum proteomic signatures for detection of hepatocellular carcinoma and its subtypes. Clin. Chem. 49, 752–760.

    PubMed  Article  CAS  Google Scholar 

  54. 54

    Kohonen, T. (1995) Self Organizing Maps. Springer Publishers, Berlin, Germany.

    Google Scholar 

  55. 55

    Breiman, L. (1996) Bagging predictors. Machine Learning 24, 123–140.

    Google Scholar 

  56. 56

    Izmirilan, G. (2004) Application of random forest classification algorithm to a SELDI-TOF Proteomics study in the setting of a cancer prevention trial. Ann. NY Acad. Sci. 1020, 154–174.

    Article  Google Scholar 

  57. 57

    Tourassi, G. D. and Floyd, C. E. (1997) The effect of data sampling on the performance evaluation of artificial neural networks in medical diagnosis. Med. Dec. Mak. 17, 186–192.

    Article  CAS  Google Scholar 

  58. 58

    Wilson, L. L., Tran, L., Morton, D. L., and Hoon, D. S. B. (2004) Detection of Differentially expressed proteins in early-stage melanoma patients using SELDI-TOF mass spectrometry. Ann. NY Acad. Sci. 1022, 317–322.

    PubMed  Article  CAS  Google Scholar 

  59. 59

    Tatay, J. W., Feng, X., Sobczak, N., et al. (2003) Multiple approaches to data mining of proteomic data based on statistical and pattern classification methods. Proteomics 3, 1704–1709.

    PubMed  Article  CAS  Google Scholar 

Download references

Author information



Corresponding author

Correspondence to Saeed A. Jortani.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Thomas, A., Tourassi, G.D., Elmaghraby, A.S. et al. Data mining in proteomic mass spectrometry. Clin Proteom 2, 13–32 (2006).

Download citation


  • Data Mining
  • Feature Selection
  • Linear Discriminant Analysis
  • Receiver Operating Characteristic
  • Receiver Operating Characteristic Curve