Quality Assessment of Tandem Mass Spectra by Using a Weighted K-Means

Ding, Jiarui; Shi, Jinhong; Wu, Fang-Xiang

doi:10.1007/s12014-009-9025-4

Article
Open access
Published: 12 March 2009

Quality Assessment of Tandem Mass Spectra by Using a Weighted K-Means

Jiarui Ding¹,
Jinhong Shi² &
Fang-Xiang Wu^1,2

Clinical Proteomics volume 5, pages 15–22 (2009)Cite this article

1129 Accesses
6 Citations
Metrics details

Abstract

Introduction

The tandem mass spectrometer is a powerful tool with which to generate peptide (tandem) mass spectrum data for the analysis of complex biological protein mixtures in genomic-related disease cell lines. However, the majority of experimental tandem mass spectra cannot be interpreted by any database search engines. One of the main reasons this happens is that majority of experimental spectra are of quality too poor to be interpretable. Interpreting these “un-interpretable” spectra is a waste of time. Therefore, it is worthwhile to determine the quality of mass spectra before any interpretation.

Objectives

This paper proposes an approach to classifying tandem spectra into two groups: one with high quality and one with poor quality.

Methods

The proposed approach has two steps. First, each spectrum is mapped to a feature vector which describes the quality of the spectrum. Then, a weighted K-means clustering method is applied in order to classify the tandem mass spectra.

Results and Conclusion

Computational experiments illustrate that one cluster contains the majority of the high-quality spectra, while the other contains the majority of the poor-quality spectra. This result indicates that if we just search the spectra in the high-quality cluster, we can save the time for searching the majority of poor-quality spectra while losing a minimal amount of high-quality spectra. The software created for this work is available upon request.

Introduction

One of the most important goals in early detection of genomic-related disease such as cancer or obesity is to identify and characterize the proteins and protein complexes present in related cell lines. High-performance liquid chromatography (HPLC) coupled with a tandem mass spectrometer provides an automated, high-throughput approach widely used to generate peptide (tandem) mass spectral data for the analysis of complex biological protein mixtures [1]. Most frequently, peptide identifications are made by comparing tandem mass spectra with a sequence database to find the significantly matching peptides in the database. Through the assignment of peptides to spectra, the original proteins present in the sample are inferred. Over the past decade, many automated database search engines have been developed for assigning peptides to tandem mass spectra, for example, SEQUEST [2], Mascot [3], and Sonar [4]. These search engines, as well as de novo sequencing methods [5, 6], have been successfully applied to peptide mass spectrum assignments in many proteomics projects. However, the majority of tandem mass spectra cannot be interpreted by these and other automatic methods, even after filtering poor-quality spectra using some simple filters such as “most intensive peak selection” criterion [2–4]. There are several reasons that the automatic methods fail to interpret the mass spectra. However, one of the main reasons is that these spectra are of quality too poor to be interpretable. In general, a tandem mass spectrum is considered to be of high quality if it is produced from peptides; otherwise, it is considered to be of poor quality. Hence, it is worthwhile to develop an automatic quality assessment algorithm to discriminate high-quality from poor-quality spectra before interpretation by any method.

In the past, several supervised machine learning algorithms have been proposed to assess the quality of tandem mass spectra, which means a labeled training dataset is needed to train a classifier, and the trained classifier is used to classify spectra as high-quality or poor-quality [7–11]. Ideally, the training set should be identified by some peptide identification algorithms and manually validated, i.e., the set should be correctly labeled without or with very few falsely labeled spectra. However, these sets are hard to obtain in most cases. Worse still, tandem mass spectrometers may produce different spectra even for the same peptide under different experimental conditions. Thus, the performance of classifiers can be improved by training a classifier for each experiment. Clustering algorithms, which do not need a labeled training set, may be alterative choices for the quality assessment of tandem mass spectra.

In this paper, we propose a clustering algorithm-weighted k-means (WKM) method to classify the experimental spectra into two clusters, one with high-quality and the other with poor-quality spectra without using any prior information about the spectra dataset from search engines. The remainder of the paper is organized as follows. The “Feature Extraction” section studies the properties of theoretical spectra and introduces a means of mapping a spectrum to a feature vector. The “WKM” section introduces the WKM method. In the “Experiments and Results” section, one dataset is used to investigate the performance of the proposed method. The “Conclusions and Future Work” section concludes this study.

Feature Extraction

This subsection describes a means of mapping a tandem mass spectrum to a feature vector which describes the quality of the spectrum. To do this, the properties of theoretical spectra are discussed first.

Properties of Peptide Theoretical Spectra

Many algorithms such as SEQUEST [2], Mascot [3], and Sonar [4] have been used to assign experimental MS/MS spectra to peptides in a protein/peptide database. A key component of these algorithms is the score function, which evaluates the similarity between each experimental MS/MS spectrum and the predicted (theoretical) spectrum of a given peptide in the database. A peptide whose theoretical spectrum has the maximum similarity to the experimental spectrum is a likely candidate for the solution of the peptide identification problem. An experimental peptide mass spectrum is often expressed by a peak list, i.e., $S = \{ (x_i ,h_i )\left. {} \right|\left| {} \right.1 \leqslant i \leqslant m\} $, where (x_i, h_i) denotes the fragment ion i with m/z value x_i and intensity h_i. Since ion intensities are the results of many unknown factors and are yet difficult to utilize for spectral quality assessment, this study does not take into account intensity values of ions after the original spectra are pre-preprocessed by filtering out the noise peaks. Therefore, the peptide mass spectra in this study are reduced into a set of m/z values and are denoted by S_E.

On the other hand, the perfect MS/MS spectrum of a peptide is the theoretical spectrum. In practice, no mass spectrometers can produce perfect MS/MS spectra. However, investigating the peptide theoretical spectrum is extremely helpful for understanding the high-quality spectra which could potentially be assigned to a peptide. Let P be a peptide consisting of n amino acids a₁, a₂, …, a_n with respective mass m(a_i). The mass of peptide P is calculated by

$$m\left( P \right) = m\left( H \right) + m\left( {OH} \right) + \sum\limits_{i = 1}^n {m\left( {a_i } \right)} $$

(1)

where m(H) and m(OH) are the additional masses of the peptide’s N- and C-terminals. Hereafter, we will use m(X) to express the mass of a molecule or a group of atoms X.

In a tandem mass spectrometry experiment, a protein is fragmented into a series of peptide ions (sometime also called precursor ions or parent ions) at the first stage. For ion trap spectrometers, the produced precursor ions are mostly doubly or triply charged [12]. In the second stage, a series of selected precursor ions are fragmented further into fragment ions. For a doubly charged precursor ion, most of its fragment ions are singly charged, whereas a triply charged precursor ion, is likely to fragment at backbone bonds to form a series of singly charged and doubly charged fragment ions. Therefore, in this study, we consider both doubly charged and triply charged precursor ions, but only singly and doubly charged fragment ions.

As peptide P fragments at backbone bond between the i-th and i + 1-th amino acids counting from the N-terminal, several types of ions could be produced as shown in the Fig. 1. The singly charged ion with N-terminal is denoted by $b_i^ + $, and its m/z value is computed by

$$m\left( {b_i^ + } \right) = m\left( H \right) + \sum\limits_{j = 1}^i {m\left( {a_j } \right)} $$

(2)

The doubly charged ion with N-terminal is denoted by $b_i^{ + + } $, and its m/z value is computed by

$$m\left( {b_i^{ + + } } \right) = {{\left[ {m\left( {b_i^ + } \right) + m\left( H \right)} \right]} \mathord{\left/ {\vphantom {{\left[ {m\left( {b_i^ + } \right) + m\left( H \right)} \right]} 2}} \right. \kern-\nulldelimiterspace} 2}$$

(3)

The singly charged ion with C-terminal is denoted by $y_{n - i}^ + $, and its m/z value is computed by

$$m\left( {y_{n - i}^ + } \right) = 2 \times m\left( H \right) + m\left( {OH} \right) + \sum\limits_{j = i + 1}^n {m\left( {a_j } \right)} $$

(4)

The doubly charged ion with C-terminal is denoted by $y_{n - i}^{ + + } $, and its m/z value is computed by

$$m\left( {y_{n - i}^{ + + } } \right) = {{\left[ {m\left( {y_{n - i}^ + } \right) + m\left( H \right)} \right]} \mathord{\left/ {\vphantom {{\left[ {m\left( {y_{n - i}^ + } \right) + m\left( H \right)} \right]} 2}} \right. \kern-\nulldelimiterspace} 2}$$

(5)

From Eqs. 1 through 5, the following complementary equations

$${{m\left( P \right)} \mathord{\left/ {\vphantom {{m\left( P \right)} {2 \times m\left( H \right) = m\left( {b_{n - i}^ + } \right) + m\left( {y_{n - i}^ + } \right)}}} \right. \kern-\nulldelimiterspace} {2 \times m\left( H \right) = m\left( {b_{n - i}^ + } \right) + m\left( {y_{n - i}^ + } \right)}}$$

(6)

$${{m\left( P \right)} \mathord{\left/ {\vphantom {{m\left( P \right)} {2 + 2 \times m\left( H \right)}}} \right. \kern-\nulldelimiterspace} {2 + 2 \times m\left( H \right)}} = m\left( {b_i^{ + + } } \right) + {{\left( {m\left( {y_{n - i}^ + } \right) + m\left( H \right)} \right)} \mathord{\left/ {\vphantom {{\left( {m\left( {y_{n - i}^ + } \right) + m\left( H \right)} \right)} 2}} \right. \kern-\nulldelimiterspace} 2}$$

(7a)

$${{m\left( P \right)} \mathord{\left/ {\vphantom {{m\left( P \right)} {2 + 2 \times m\left( H \right)}}} \right. \kern-\nulldelimiterspace} {2 + 2 \times m\left( H \right)}} = {{\left( {m\left( {b_i^ + } \right) + m\left( H \right)} \right)} \mathord{\left/ {\vphantom {{\left( {m\left( {b_i^ + } \right) + m\left( H \right)} \right)} {2 + m\left( {y_{n - i}^{ + + } } \right)}}} \right. \kern-\nulldelimiterspace} {2 + m\left( {y_{n - i}^{ + + } } \right)}}$$

(7b)

$${{m\left( P \right)} \mathord{\left/ {\vphantom {{m\left( P \right)} {2 + 2 \times m\left( H \right)}}} \right. \kern-\nulldelimiterspace} {2 + 2 \times m\left( H \right)}} = {{\left( {m\left( {b_i^{ + + } } \right) + m\left( H \right)} \right)} \mathord{\left/ {\vphantom {{\left( {m\left( {b_i^{ + + } } \right) + m\left( H \right)} \right)} {2 + m\left( {y_{n - i}^{ + + } } \right)}}} \right. \kern-\nulldelimiterspace} {2 + m\left( {y_{n - i}^{ + + } } \right)}}$$

(8)

hold for a theoretical peptide spectrum. Therefore, Eqs. 6 through 8 indicate that high-quality spectra should have more complementary pairs of m/z values than poor-quality spectra.

According to the principle of peptide fragmentation in tandem mass spectrometry [13], these ions could lose a molecule of water or ammonia. Therefore, high-quality spectra should also have pairs of m/z values with differences of (half) a water molecular mass or an ammonia molecular mass for (doubly) singly charged ions, in contrast with poor-quality spectra. In addition, the N-terminal ions could lose a CO group, while C-terminal could lose an NH group, Therefore, high-quality spectra could have more pairs of m/z values with differences of (half) a CO mass or (half) an NH mass for (doubly) singly charged ions compared with poor-quality spectra.

In addition, for a theoretical spectrum, the difference between two consecutive singly charged N-terminal (C-terminal) ions is one of 20 amino acid mass weights. The difference between two consecutive doubly charged N-terminal (C-terminal) ions is half a mass weight of one of 20 amino acids. Therefore, high-quality spectra should also have more pairs of m/z values with difference of (half) an amino acid mass weight for (doubly) singly charged ions than poor-quality spectra.

Features of Peptide Mass Spectra

According to the properties of the theoretical spectra, we introduce 12 discriminatory features to describe the quality of peptide mass spectra. These features may be classified into four categories: amino acid distances, complements, water or ammonia losses, and supportive ions. To do this, we first define four variables for a given peptide mass spectrum S_E. For a peak in S_E with m/z value x, this peak is also denoted by x for simplicity. In the following, x and y are the m/z values of peaks x and y, respectively.

$${\text{dif}}1\left( {x,y} \right) = x - y{\text{,}}\quad {\text{ }}x,y \in S_E {\text{ }}$$

(9)

$${\text{dif}}2\left( {x,y} \right) = {{x - \left( {y + 1} \right)} \mathord{\left/ {\vphantom {{x - \left( {y + 1} \right)} 2}} \right. \kern-\nulldelimiterspace} 2},\quad x,y \in S_E $$

(10)

$${\text{sum}}1\left( {x,y} \right) = x + y{\text{,}}\quad x,y \in S_E $$

(11)

$${\text{sum}}2\left( {x,y} \right) = {{x + \left( {y + 1} \right)} \mathord{\left/ {\vphantom {{x + \left( {y + 1} \right)} 2}} \right. \kern-\nulldelimiterspace} 2}{\text{,}}\quad x,y \in S_E $$

(12)

1.
Amino acid distances: These features measure how likely two components in a peptide mass spectrum $S_E $ are to differ by one of 20 amino acids. Let
$${\text{DIF}}_1 = \left\{ {\left. {\left( {x,y} \right)} \right|{\text{dif}}1\left( {x,y} \right) \approx M_i ,i = 1, \cdots ,17} \right\}$$
$${\text{DIF}}_2 = \left\{ {\left. {\left( {x,y} \right)} \right|{{{\text{dif}}1\left( {x,y} \right) \approx M_i } \mathord{\left/ {\vphantom {{{\text{dif}}1\left( {x,y} \right) \approx M_i } {2,i = 1, \cdots ,17}}} \right. \kern-\nulldelimiterspace} {2,i = 1, \cdots ,17}}} \right\}$$
$${\text{DIF}}_3 = \left\{ {\left. {\left( {x,y} \right)} \right|{{{\text{dif}}2\left( {x,y} \right) \approx M_i } \mathord{\left/ {\vphantom {{{\text{dif}}2\left( {x,y} \right) \approx M_i } {2,i = 1, \cdots ,17}}} \right. \kern-\nulldelimiterspace} {2,i = 1, \cdots ,17}}} \right\}$$

where M₁,⋯, M₁₇ stand for the 17 mass weights of all 20 amino acids. In this study, we consider all methionine amino acids to be sulfoxidized and do not distinguish three pairs of amino acids in their masses: isoleucine vs. leucine, glutamine vs. lysine, and sulfoxidized methionine vs. phenylalanine. This is because the masses of each of these three pairs are very close. The comparison implied by ≈ uses a tolerance which is set to 0.5 Thompson in this study, but can be changed by the user. The set DIF₁ collects all pairs of singly charged ions in the spectrum S_E that are different from one amino acid. The set DIF₂ collects all pairs of doubly charged ions in the spectrum S_E that are different from one amino acid. The set DIF₃ collects all pairs of one doubly charged and the other singly charged ions that are different from one amino acid. Let

$$F_i = \left| {{\text{DIF}}_i } \right|,\quad i = 1,2,3$$

where $\left| \bullet \right|$ represents the cardinality of a set. If a tandem mass spectrum is produced from a peptide with well fragmentation, one expects that values $F_i \;\left( {i = 1,2,3} \right)$ calculated from this spectrum should be much higher than those from a spectrum produced randomly.

2.
Complements: These features measure how likely an N-terminus ion and a C-terminus ion in the peptide mass spectra S_E are to be produced as the peptide fragments at the same peptide bond. Let
$${\text{SUM}}_1 = \left\{ {\left. {\left( {x,y} \right)} \right|{\text{sum}}1\left( {x,y} \right) \approx M_{{\text{parent}}} + 2 \times m\left( H \right)} \right\}$$
$${\text{SUM}}_2 = \left\{ {\left( {x,y} \right)|{\text{sum}}1\left( {x,y} \right) \approx {{M_{{\text{parent}}} } \mathord{\left/ {\vphantom {{M_{{\text{parent}}} } {2 + 2 \times m\left( H \right)}}} \right. \kern-\nulldelimiterspace} {2 + 2 \times m\left( H \right)}}} \right\}$$
$${\text{SUM}}_3 = \left\{ {\left( {x,y} \right)|{\text{sum}}2\left( {x,y} \right) \approx {{M_{{\text{parent}}} } \mathord{\left/ {\vphantom {{M_{{\text{parent}}} } {2 + 2 \times m\left( H \right)}}} \right. \kern-\nulldelimiterspace} {2 + 2 \times m\left( H \right)}}} \right\}$$

where M_parent is the mass of the precursor ion of spectrum S_E. The set SUM₁ collects the complementary pairs of singly charged ions. The set SUM₂ collects the complementary pairs of doubly charged ions. The set SUM₃ collects the complementary pairs of one doubly charged ion and the other singly charged ion. For the same reason given for the first three features, we define another three features as the cardinalities of these three sets, i.e.,

$$F_{3 + i} = \left| {{\text{SUM}}_i } \right|{\text{,}}\quad i = 1,2,3$$

3.
Water or ammonia losses: These features measure how likely one ion in a peptide mass spectrum S_E is to be produced by losing a water or an ammonia molecule from other ions. Let
$${\text{WAD}}_1 = \left\{ {\left. {\left( {x,y} \right)} \right|{\text{dif}}1\left( {x,y} \right) \approx M_{{\text{water}}} {\text{or}}M_{{\text{ammonia}}} } \right\}$$
$${\text{WAD}}_2 = \left\{ {\left. {\left( {x,y} \right)} \right|{\text{dif}}1\left( {x,y} \right) \approx M_{{\text{water}}} {\text{or}}{{M_{{\text{ammonia}}} } \mathord{\left/ {\vphantom {{M_{{\text{ammonia}}} } 2}} \right. \kern-\nulldelimiterspace} 2}} \right\}$$
$${\text{WAD}}_3 = \left\{ {\left. {\left( {x,y} \right)} \right|{\text{dif}}2\left( {x,y} \right) \approx M_{{\text{water}}} {\text{or}}{{M_{{\text{ammonia}}} } \mathord{\left/ {\vphantom {{M_{{\text{ammonia}}} } 2}} \right. \kern-\nulldelimiterspace} 2}} \right\}$$

where M_water and M_ammonia are the mass of a water molecule and an ammonia molecule, respectively. The set WAD₁ collects the pairs of singly charged ions with a difference of a water or an ammonia molecule. The set WAD₂ collects the pairs of doubly charged ions with a difference of a water or an ammonia molecule. The set WAD₃ collects the pairs of one doubly charged ion and the other singly charged ion with a difference of a water or an ammonia molecule. Similarly, we define the next three features as the cardinalities of these three sets, i.e.,

$$F_{6 + i} = \left| {{\text{WAD}}_i } \right|{\text{,}}\quad i = 1,2,3$$

One can consider the water losses and the ammonia losses separate features, but the resulting feature vector will have more components. In the classification problem, more features do not mean a better classifier. The reverse is often true, as the insignificant features could degrade the discriminatory power of other significant features [14].

4.
Supportive ions: These features measure how likely one ion in a peptide mass spectrum S_E is to be a supportive ion. In this paper, we consider two kinds of supportive ions: a-ions and z-ions. Although a-ions and x-ions are complementary if a peptide fragments at the specific bond shown in Fig. 1, the a-ions are often generated by losing a CO group from b-ions [13], but not by fragmenting at the specific bond. For the same reason, we take z-ions into account but not c-ions
$${\text{AZD}}_1 = \left\{ {\left. {\left( {x,y} \right)} \right|{\text{dif}}1\left( {x,y} \right) \approx M_{{\text{CO}}} {\text{or}}M_{{\text{NH}}} } \right\}$$
$${\text{AZD}}_2 = \left\{ {\left( {x,y} \right)|{\text{dif}}1\left( {x,y} \right) \approx M_{{\text{CO}}} {\text{or}}{{M_{{\text{NH}}} } \mathord{\left/ {\vphantom {{M_{{\text{NH}}} } 2}} \right. \kern-\nulldelimiterspace} 2}} \right\}$$
$${\text{AZD}}_3 = \left\{ {\left( {x,y} \right)|{\text{dif}}2\left( {x,y} \right) \approx M_{{\text{CO}}} {\text{or}}{{M_{{\text{NH}}} } \mathord{\left/ {\vphantom {{M_{{\text{NH}}} } 2}} \right. \kern-\nulldelimiterspace} 2}} \right\}$$

where M_CO and M_NH are the mass of a CO group and an NH group, respectively. The set AZD₁ collects the pairs of singly charged ions with a difference of a CO or an NH group. The set AZD₂ collects the pairs of doubly charged ions with a difference of a CO or an NH group. The set AZD₃ collects the pairs of one doubly charged ion and the other singly charged ion with a difference of a CO or an NH group. Finally, we define the next three features as the cardinalities of these three sets, i.e.,

$$F_{9 + i} = \left| {{\text{AZD}}_i } \right|{\text{,}}\quad i = 1,2,3$$

At this point, we have introduced 12 features with physical meaning to describe the quality of peptide spectra. The four features F_j (j = 1, 4, 7, 10) are evidence of the existence of singly charged ions, called singly charged features. The other eight features are evidence of the existence of doubly charged ions. In principle, the high-quality spectra are expected to have larger feature values than the poor-quality spectra. However, the longer the peptide, the larger the feature values are. The classifier used for quality assessment may have low sensitivity, as the high-quality spectra produced from a shorter peptide would have smaller feature values. To alleviate these effects, we normalize the feature values by formula ${{F_i } \mathord{\left/ {\vphantom {{F_i } {\log \left( {L_{\text{E}} } \right)}}} \right. \kern-\nulldelimiterspace} {\log \left( {L_{\text{E}} } \right)}}$, where L_E is the estimated peptide length of a peptide ion. L_E is computed by dividing the peptide ion mass by an average amino acid mass of 110 Da.

WKM

Let (x_i,i = 1,⋯,n) be a dataset of n objects (spectra) with the dimensionality of d. Let x_ij denote the jth feature of object x_i. X = (x_ij) is called a feature matrix of object set D. For a given partition Δ with K clusters, the cost function for a weighted K-means clustering technique [15] is defined by

$$J_{G} {\left( \Delta \right)} = {\sum\limits_{k = 1}^K {{\sum\limits_{x_{i} \in D_{k} } {{\left( {x_{i} - \overline{m} _{k} } \right)}G{\left( {x_{i} - \overline{m} _{k} } \right)}\prime } }} }$$

(13)

where $\overline m _k = \frac{1}{{n_k }}\sum\limits_{x_i \in D_k } {x_i } $, n_k are the mean and the number of objects in D_k, respectively, and G is an arbitrary symmetrical positive matrix whose determinant is 1, i.e.,$\left( {\det \left( G \right)} \right) = 1$.

The objective of a weighted K-means algorithm is to find an optimal partition expressed by Δ* and a symmetrical positive matrix G* with the determinant of 1 which minimize J_G(Δ), i.e.,

$$J_{G*} \left( {\Delta *} \right) = \mathop {\min }\limits_\Delta \left\{ {J_{G*} \left( \Delta \right)} \right\}$$

(14)

The problem is a constraint optimization problem. By the use of Lagrange multiplier, it can prove that a given partition Δ with K clusters

$$G = W^{ - 1} \left( {\det \left( W \right)} \right)^{{1 \mathord{\left/ {\vphantom {1 d}} \right. \kern-\nulldelimiterspace} d}} $$

(15)

where $W = \sum\limits_{k = 1}^K {W_k } $ and $W_{k} = {\sum\limits_{x_{i} \in D_{k} } {{\left( {x_{i} - \overline{m} _{k} } \right)}\prime {\left( {x_{i} - \overline{m} _{k} } \right)}} }$ is the within-group variance of cluster k (k = 1,⋯,K). Obviously, W is dependent on partition Δ. To avoid ambiguousness, denote W induced by Δ as W(Δ). Substituting Eq. 15 into 13 leads to $J\left( \Delta \right) = d\det \left( {W\left( \Delta \right)} \right)$. Since d is constant, the cost function of a weighted K-mean algorithm can be reduced to

$$J\left( \Delta \right) = \left( {\det \left( {W\left( \Delta \right)} \right)} \right)^{{1 \mathord{\left/ {\vphantom {1 d}} \right. \kern-\nulldelimiterspace} d}} $$

(16)

The objective of a weighted K-mean algorithm becomes to find an optimal partition expressed by Δ_o which minimizes

$$J\left( {\Delta _{\text{o}} } \right) = \mathop {\min }\limits_\Delta \left( {\det \left( {W\left( \Delta \right)} \right)} \right)^{{1 \mathord{\left/ {\vphantom {1 d}} \right. \kern-\nulldelimiterspace} d}} $$

(17)

Now consider how the cost function J changes when an object x currently in cluster D_i tentatively moves to a different cluster D_j. Let $\Delta = {\left( {D_{1} , \cdots ,D_{K} } \right)}{\text{,}}\,\Delta \prime = {\left( {D_{1} , \cdots ,D_{i} \backslash {\left\{ x \right\}}, \cdots D_{K} } \right)}$, and $\Delta \prime \prime = \left( {D_1 , \cdots ,D_i \backslash \left\{ x \right\}, \cdots ,D_j \cup \left\{ x \right\}, \cdots ,D_K } \right){\text{ }}\left( {i \ne j} \right)$. Obviously the condition for successfully moving x from D_i into D_j is

$$\det \left( {W\left( {\Delta \prime \prime } \right)} \right) <\det \left( {W\left( \Delta \right)} \right)$$

(18)

The following two equations can be derived from the definitions

$$W{\left( \Delta \right)} = W{\left( {\Delta \prime } \right)} + \frac{{m_{i} }}{{m_{i} - 1}}{\left( {x - \overline{x} _{i} } \right)}\prime {\left( {x - \overline{x} _{i} } \right)}$$

(19)

$${\left( {\Delta \prime \prime } \right)} = W{\left( {\Delta \prime } \right)} + \frac{{m_{j} }}{{m_{j} + 1}}{\left( {x - \overline{x} _{j} } \right)}\prime {\left( {x - \overline{x} _{j} } \right)}$$

(20)

Condition 18 is reduced to

$$\frac{{m_{j} }}{{m_{j} + 1}}{\left( {x - \overline{x} _{j} } \right)}{\left[ {W{\left( {\Delta \prime } \right)}} \right]}^{{ - 1}} {\left( {x - \overline{x} _{j} } \right)}\prime < \frac{{m_{i} }}{{m_{i} - 1}}{\left( {x - \overline{x} _{i} } \right)}{\left[ {W{\left( {\Delta \prime } \right)}} \right]}^{{ - 1}} {\left( {x - \overline{x} _{i} } \right)}\prime$$

(21)

since $\det {\left( {A + \beta y\prime \,y} \right)} = \det {\left( A \right)}{\left( {1 + \beta yA^{{ - 1}} y\prime } \right)}$ for any d × d invertible matrix A, any d-dimensional row vector y, and any number β.

If reassignment is profitable, the greatest decrease in the cost function is obtained by selecting the cluster for which $\frac{{m_{j} }}{{m_{j} + 1}}{\left( {x - \overline{x} _{j} } \right)}{\left[ {W{\left( {\Delta \prime } \right)}} \right]}^{{ - 1}} {\left( {x - \overline{x} _{j} } \right)}\prime $ is minimal. According to the above discussion, an iterative optimal weighted K-means algorithm is designed and shown in Fig. 2.

Experiments and Results

Dataset

This study employs the standard protein mixture (SPM) dataset acquired on an ion trap mass spectrometer [16, 17] to investigate the performance of the proposed method. This dataset consists of 37,044 peptide tandem spectra collected in 22 HPLC/MS/MS runs. The samples analyzed were generated by the tryptic digestion of a control mixture of standard 18 proteins (not of human origin) [16, 17]. The MS/MS spectra were searched using SEQUEST against a human protein database appended with the sequences of the 18 standard proteins and other common contaminants. The SEQUEST will be used to verify the clustering results.

The spectra with different charges have significant different properties. This study applies the proposed method to two subsets of the SPM dataset: one subset consisting of all 18,496 doubly charged spectra (denoted by SPM2) and the other consisting of all 18,044 triply charged spectra (denoted by SPM3). All singly charged spectra are ignored in this study.

Results

Using the proposed method, SMP2 is divided into two clusters: cluster one consisting of 11,365 spectra and cluster two consisting of 7,131 spectra. Table 1 lists the mean centers of two clusters. Obviously, the spectra in cluster 2 are of high quality, while those in cluster 1 are of poor quality because the mean center of cluster 2 is much larger than that of cluster 1.

Table 1 The mean centers for SMP2

Full size table

Table 2 shows the number of spectra with the SEQUEST scores greater than a variety of threshold values. It indicates that the majority of the spectra with higher SEQUEST scores are in cluster 2. Generally, if the SEQUEST score of a doubly charged spectrum is greater than 2.5, this spectrum is considered to be identified (well interpreted). If we used the SEQUEST score of 2.6 as the cutoff value, ${\text{85}}{\text{.53\% }}\left( {{{{\text{ = 969}}} \mathord{\left/ {\vphantom {{{\text{ = 969}}} {\left( {{\text{164 }} + {\text{ 969}}} \right)}}} \right. \kern-\nulldelimiterspace} {\left( {{\text{164 }} + {\text{ 969}}} \right)}}} \right)$ of the interpretable spectra are in cluster 2. In other words, if we just search spectra in cluster 2 using a database, we can save ${\text{61}}{\text{.45\% }}\left( {{{{\text{ = 11,365}}} \mathord{\left/ {\vphantom {{{\text{ = 11,365}}} {\left( {{\text{11,365 }} + {\text{ 7,131}}} \right)}}} \right. \kern-\nulldelimiterspace} {\left( {{\text{11,365 }} + {\text{ 7,131}}} \right)}}} \right)$ of the time while only losing 14.47% (=1–85.53%) of the interpretable spectra.

Table 2 Number of spectra in two clusters with respect to SEQUEST score

Full size table

Using the proposed method, SMP3 is also divided into two clusters: cluster one consisting of 7,739 spectra and cluster two consisting of 10,305 spectra. Table 3 lists the means centers of two clusters. Obviously, the spectra in cluster 1 are of high quality, while those in cluster 2 are of poor quality, as the mean center of cluster 1 is much larger than that of cluster 2.

Table 3 The mean centers for SMP 3

Full size table

Table 4 shows the numbers of spectra with the SEQUEST scores greater than a variety of threshold values. It indicates that the majority of the spectra with higher SEQUEST scores are in cluster 1. Generally, if the SEQUEST score of a triply charged spectrum is greater than 3.5, this spectrum is considered to be identified (well interpreted). If we used the SEQUEST score of 3.6 as the cutoff value, ${\text{97}}{\text{.65\% }}\left( {{{{\text{499}}} \mathord{\left/ {\vphantom {{{\text{499}}} {\left( {{\text{12 }} + {\text{ 499}}} \right)}}} \right. \kern-\nulldelimiterspace} {\left( {{\text{12 }} + {\text{ 499}}} \right)}}} \right)$ of the interpretable spectra are in cluster 1. In other words, if we just search spectra in cluster 1 using a database, we can save ${\text{57}}{\text{.11\% }}{\left( {{{\text{ = 10,305}}} \mathord{\left/ {\vphantom {{{\text{ = 10,305}}} {{\left( {{\text{10,305 }} + {\text{ 7,739}}} \right)}}}} \right. \kern-\nulldelimiterspace} {{\left( {{\text{10,305 }} + {\text{ 7,739}}} \right)}}} \right)}$ of the time while losing only 2.35% (=1–97.65%) of the interpretable spectra.

Table 4 Number of spectra in two clusters with respect to SEQUEST score

Full size table

In summary, considering the SMP2 and SMP3 as a whole dataset, if we just search spectra in the clusters with high quality by SEQUEST, we can save $59.3\% {\text{ }}\left( {{{ = {\text{ }}\left( {10,305{\text{ }} + {\text{ }}11,365} \right)} \mathord{\left/ {\vphantom {{ = {\text{ }}\left( {10,305{\text{ }} + {\text{ }}11,365} \right)} {\left( {11,365{\text{ }} + {\text{ }}7,131{\text{ }} + {\text{ }}10,305{\text{ }} + {\text{ }}7,739} \right)}}} \right. \kern-\nulldelimiterspace} {\left( {11,365{\text{ }} + {\text{ }}7,131{\text{ }} + {\text{ }}10,305{\text{ }} + {\text{ }}7,739} \right)}}} \right)$ of the time while losing only ${\text{10}}{\text{.71\% }}\left( {{{{\text{ = 1 - }}\left( {{\text{969 + 499}}} \right)} \mathord{\left/ {\vphantom {{{\text{ = 1 - }}\left( {{\text{969 + 499}}} \right)} {\left( {{\text{164 + 969 + 12 + 499}}} \right)}}} \right. \kern-\nulldelimiterspace} {\left( {{\text{164 + 969 + 12 + 499}}} \right)}}} \right)$ of the interpretable spectra in the cluster with poor quality.

Conclusions and Future Work

The evaluation of tandem mass spectra is important for the reduction of the database search time. This study has proposed a method of classifying tandem mass spectra into one group of mass spectra with high quality and one with poor quality. Computational experiments illustrate that if we just search the spectra in the high-quality group, we can save about 60% of searching time while losing only about 10% of high-quality spectra. This result indicates that the proposed method is useful in saving database search time because it ignores the spectra in the cluster with poor quality.

In this study, the proposed method has been applied to raw tandem mass spectra which were noise-contaminated. Recently, we have developed a method to denoise raw tandem mass spectra [18], which can improve the reliability of peptide identification. It could make more sense and improve the reliability of tandem mass spectral quality assessment by classifying denoised mass spectra. One direction of our future work is to combine the denoising method with quality assessment methods to improve the reliability of mass spectral quality assessment.

References

Aebersold R, Mann M. Mass spectrometry-based proteomics. Nature 2003;422:198–207.
Article PubMed CAS Google Scholar
Eng KJ, McCormack AL, Yates JR. An approach to correlate tandem mass spectral data of peptides with amino acid sequence in a protein database. J Am Soc Mass Spectrom 1994;5:976–89.
Article PubMed CAS Google Scholar
Perkins DN, Pappin DJ, Creasy DM, Cottrell JS. Probability-based protein identification by searching sequence database using mass spectrometry data. Electrophoresis 1999;20:3551–67.
Article PubMed CAS Google Scholar
Field HI, Fenyö D, Beavis RC. RADARS, a bioinformatics solution that automates proteome mass spectral analysis, optimizes protein identification, and archives data in a relation database. Proteomics 2002;2:36–47.
Article PubMed CAS Google Scholar
Frank A, Pevzner P. PepNovo: de novo peptide sequencing via probabilistic network modeling. Anal Chem 2005;77:964–73.
Article PubMed CAS Google Scholar
Ma B, Zhang K, Hendrie C, Liang C, Li M, Doherty-Kirby A, Lajoie G. Peaks: powerful software for peptide de novo sequencing by tandem mass spectrometry. Rapid Commun Mass Spectrom 2003;17:337–2342.
Article Google Scholar
Bern M, Goldberg D, McDonald W, Yates J. Automatic quality assessment of peptide tandem mass spectra. Bioinformatics 2004;20:i49–i54.
Article PubMed CAS Google Scholar
Salmi J, Moulder R, Filen J, Nevalainen O, Nyman T, Lahesmaa R, Aittokallio T. Quality classification of tandem mass spectrometry data. Bioinformatics 2006;22:400–6.
Article PubMed CAS Google Scholar
Flikka K, Martens L, Vandekerckhove J, Gevaert K, Eidhammer I. Improving the reliability and throughput of mass spectrometry-based proteomics by spectrum quality filtering. Proteomics 2006;6:2086–94.
Article PubMed CAS Google Scholar
Na S, Paek E. Quality assessment of tandem mass spectra based on cumulative intensity normalization. J Proteome Res 2006;5:3241–8.
Article PubMed CAS Google Scholar
Nesvizhskii A, Roos F, Grossmann J, Vogelzang M, Eddes J, Gruissem W, Baginsky S, Aebersold R. Dynamic spectrum quality assessment and iterative computational analysis of shotgun proteomic data. Mol Cell Proteomics 2006;5:652–70.
Article PubMed CAS Google Scholar
Wu FX, Gagne P, Droit A, Poirier GG. Quality assessment of peptide tandem mass spectra. BMC Bioinformatics 2008;9:S13.
Article PubMed PubMed Central Google Scholar
Kinter M, Sherman NE. Protein sequencing and identification using tandem mass spectrometry. New York: Wiley; 2000.
Book Google Scholar
Ding J, Shi JH, Zou AM, Wu FX. Feature selection for tandem mass spectrum quality assessment, Proceedings of IEEE International Conference on Bioinformatics and Biomedicine, 2008, pp: 310–13.
Spath H. Cluster analysis algorithms for data reduction and classification of objects. West Sussex, UK: Ellis Horwood Limited; 1975.
Google Scholar
Keller A, Purvine S, Nesvizhskii AI, Stolyar S, Goodlett DR, Kolker E. Experimental protein mixture for validating tandem mass spectral analysis. OMICS 2002;6:207–12.
Article PubMed CAS Google Scholar
Keller A, Nesvizhskii AI, Kolker E, Aebersold R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal Chem 2002;74:5383–92.
Article PubMed CAS Google Scholar
Ding J, Shi JH, Poirier GG, Wu FX. A novel approach to denoising tandem mass spectra, BMC Proteome Science, Accepted, 2009.

Download references

Acknowledgments

This research is supported by Natural Sciences and Engineering Research Council of Canada (NSERC). We would like to thank Dr Andrew Keller from Institute for Systems Biology for generously providing spectral data and protein databases for SPM dataset in this paper.

Author information

Authors and Affiliations

Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, SK, S7N 5A9, Canada
Jiarui Ding & Fang-Xiang Wu
Division of Biomedical Engineering, University of Saskatchewan, Saskatoon, SK, S7N 5A9, Canada
Jinhong Shi & Fang-Xiang Wu

Authors

Jiarui Ding
View author publications
You can also search for this author in PubMed Google Scholar
Jinhong Shi
View author publications
You can also search for this author in PubMed Google Scholar
Fang-Xiang Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fang-Xiang Wu.

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Ding, J., Shi, J. & Wu, FX. Quality Assessment of Tandem Mass Spectra by Using a Weighted K-Means. Clin Proteom 5, 15–22 (2009). https://doi.org/10.1007/s12014-009-9025-4

Download citation

Received: 13 November 2008
Accepted: 23 February 2009
Published: 12 March 2009
Issue Date: March 2009
DOI: https://doi.org/10.1007/s12014-009-9025-4

Quality Assessment of Tandem Mass Spectra by Using a Weighted K-Means