Engineered QC features quantify the quality of chromatographic peaks
Poor chromatography and interference compromise the quality of a peak. The factors most affected by this compromise are peak shape, symmetry, jaggedness, FWHM, retention time shift, and the consistency of the ratios of peptide transitions. TargetedMSQC calculates quality features that have been engineered to capture changes in peak quality. A total of thirty-two features have been developed to describe the peak quality. These features can be categorized into nine main groups of jaggedness, peak similarity, symmetry, FWHM, modality, shift, intensity, area ratio and retention time. Additional file 1: Table S1 lists these QC features and their definitions in detail.
Figure 2 depicts several representative examples of chromatographic peaks and their corresponding QC features. For instance, jaggedness is defined as the fraction of time points across a peak where the signal changes direction, excluding the peak apex and therefore, returns a value of 0.0 for a smooth peak, whereas jaggedness scores closer to 1.0 indicate a noisy peak (Fig. 2a). Similarity score between two peaks is determined by the Pearson correlation coefficient between the peak intensities, which yields a value of 1.0 for highly similar peaks that mirror the shapes of one another and lower values for pairs with differences in peak shapes (Fig. 2b). Symmetry score for a peak quantifies how symmetric a peak is along its time midpoint and yields values close to 1.0 for perfectly symmetric and lower scores for asymmetric peaks (Fig. 2c). FWHM and FWHM to peak base width ratio are well-known quality metrics for chromatographic peaks that can deviate from a normal range due to interference resulting in peak shoulders or poor chromatography (Fig. 2d, e). Modality score is defined as the largest dip in the peak, normalized by peak height. For example, the three transitions represented in Fig. 2f show varying levels of bimodal behavior with modality scores ranging from 0.0 for the smooth y4 transition to 0.3 for the highly bimodal y7 transition. Peaks with high intensities at the peak boundary may be subject to interference and therefore the peak intensity at the boundary was introduced as a QC feature in this schema (Fig. 2g). Additionally, all the fragment ions that belong to the same peptide must co-elute. Therefore, observing a shift in the elution of one transition flags the peak for manual reanalysis. This shift in elution is quantified by shift score as shown in Fig. 2h. Finally, an important attribute of peak groups in a mass spectrometry experiment is consistency of relative ratios of transitions not only between endogenous and standard isotopes, but also across samples. Pair ratio consistency is one of the metrics that quantifies this feature and is able to identify transitions where this ratio deviates from expected values as shown for y3 transition in Fig. 2i.
An ensemble of these QC features enables a quantitative way to evaluate peak quality. For instance, a peak that may suffer from nothing more than tailing will result in low symmetry scores while showing close to perfect scores for all the other features, whereas interference in a peak may manifest in poor similarity, modality, shift and isotope ratio consistency scores. TargetedMSQC uses supervised learning based on a set of manually annotated peaks with similar features to train a model that is capable of predicting peak qualities in new samples, which is particularly useful for performing QC on large datasets.
Peak quality assessment of CSF biomarkers in artificial matrix
To demonstrate the feasibility of developing a predictive peak quality model, our approach was first applied to a dataset of AQUA™ peptides of CSF candidate biomarkers spiked into bovine serum albumin (BSA) as an artificial CSF matrix. This experiment was used for optimization of the CSF biomarker panel. Samples were analyzed by an MRM panel of 144 unique peptide transitions to quantify 36 peptides using stable-isotope labeled internal standards.
Training dataset
Four runs were used to generate the training set containing a total of 288 peak groups (144 endogenous and 144 spiked-in internal standards) and 576 transition pairs. A minimum accuracy of 90% was agreed upon as acceptable performance in this study. The learning curve for this dataset (Additional file 1: Figure S1) shows that at least 400 transition pairs are required to achieve this performance. Of the 576 peaks in the training set, 199 were flagged by an analyst for quality issues pertaining to interference, poor chromatography or low signal. Characteristics such as bimodality, jagged edges, severe tailing, variability in transition ratios and low spiked-in standard signal were used by the analyst to identify and flag low quality peaks. The remaining 377 peaks were considered of acceptable quality.
The training dataset consists of a list of peaks identified by run, peptide sequence, precursor charge, fragment ion, product charge and quality status, which has been determined manually by an analyst. Peak quality metrics were calculated for the training set; violin plots for the distribution of each metric are shown in Additional file 1: Figure S2. The training set was used as a guide to build a predictive model that can identify low quality peaks based on an ensemble of QC features such as peak symmetry, similarity, modality, jaggedness, co-elution and FWHM. Distribution of features for this dataset (Additional file 1: Figure S2) provides a quick glance at the overall quality of peaks in this set as judged by the analyst. As shown in Additional file 1: Figure S2, majority of the features exhibit a pattern in relationship to the quality status determined through manual inspection. For example, flagged peaks tend to have higher scores for features such as modality, shift, isotope pair ratio consistency, coefficient of variation and intensity at the peak boundary and lower scores for features such as symmetry, similarity, and correlation between peak area ratios of endogenous and standard isotopes. However, there is a not a single feature that can effectively discriminate between high and low quality peaks, further emphasizing the importance of utilizing multiple features and multivariate statistical methods to identify peaks with poor quality. The training dataset in this study contains peaks on a diverse quality spectrum. Therefore, by including a diverse set of peaks in the training dataset the capability of the predictive model to identify a wider range of quality issues is enhanced.
Predictive peak quality model development
After calculating the QC features for the peaks in the training dataset, several supervised learning methods were examined to construct a peak quality predictive model: Regularized logistic regression, regularized random forest (RRF), K-nearest neighbor (KNN) and support vector machine (SVM) with linear and polynomial kernels. When applicable, regularization was used to simplify the model and reduce the risk of overfitting. The models were built based on 80% of the data randomly sampled from the training set and their performances were further verified by testing the models on the remaining 20% of the data, named validation subset. The performances of the models were estimated by comparing the predicted peak qualities for each model with the observed qualities as determined by the analyst.
The performances of these models are summarized in Additional file 1: Table S2 and Figure S3. Among the five investigated methods, regularized random forest achieved the highest performance with an accuracy of 94.5% to distinguish flagged from high quality peaks in the training subset, and sensitivity and specificity rates of 97.4% and 98.7% on the validation subset, respectively. The negligible difference between model classification accuracy on the training and validation subsets suggests minimal overfitting. The performance of the other four models was reasonable as well, ranging from accuracies of 89.8% for linear SVM to 93.4% for the KNN models; however most of them showed low sensitivity for detection of poor quality peaks in the validation subset. In general, considering that training of an RRF model is computationally expensive, it takes longer to build an RRF model compared to the other methods employed in this study. For example, training the RRF model, including cross validation and parameter tuning, took ~ 40 min on a single processor. In comparison, training of the KNN model with the same cross validation method and similarly sized parameter tuning set was six times faster. Therefore, alternative methods may still be worthwhile to explore for different applications or datasets. Additional file 1: Figure S4 shows the relative importance of the QC features in determining the quality outcome for a peak in the RRF predictive model. As shown in this figure, compared to jaggedness, shift and modality metrics, features that quantify peak intensity, isotope ratio consistency, peak similarity and FWHM seem to play a more crucial part in determining the outcome of the predictive peak quality model. One explanation could be that this particular dataset has very few jagged or bimodal peaks and therefore, metrics such as jaggedness and modality do not play a defining role in distinguishing high and low quality peaks.
The high sensitivity and specificity of the model for distinguishing low quality peaks suggest that the developed QC features are capable of capturing the peak quality in a similar way to manual inspection of the peaks. These features can subsequently be leveraged by machine learning to make predictions about peak quality in unseen data, as long as the training set provides a fair representation of the range of peaks in the dataset under examination by the model. This example demonstrates great agreement between the manual inspection calls and the model outcome. In this example, the rare instances of disagreement between the model and the analyst are marginal cases (Additional file 1: Figure S5) where the peak quality lies in the gray zone and does not have a tangible effect on the quantitative outcome of the experiment.
Peak quality assessment of longitudinal CSF biomarkers of AD dataset
To further evaluate the practicality and performance of the proposed quality assessment framework, TargetedMSQC was applied to a dataset from a longitudinal study of candidate progression biomarkers of Alzheimer’s disease in procured CSF samples of AD patients. Samples were analyzed by the same MRM panel in the previous example, quantifying 36 peptides using stable-isotope labeled internal standards.
Training dataset
The original dataset included 70 runs of a panel of 36 peptides in procured CSF samples from patients with Alzheimer’s disease. Eight runs were selected at random to be annotated for the training dataset. Poor quality peaks including, jagged peaks, peaks with high background or interference were flagged in the training dataset. This resulted in a set of 1128 transition pairs, with 615 ‘ok’ and 513 ‘flagged’ peaks. The learning curve for this dataset (Additional file 1: Figure S6) shows that at ~ 900 transition pairs the curve plateaus. The engineered features were calculated for each transition pair in the training set as shown in Additional file 1: Figure S7. The patterns for the distribution of features were as expected. Flagged peaks had higher jaggedness, modality, shift, peak area CV% and lower symmetry and similarity scores. Additionally, they showed higher normalized intensities at the peak boundaries, as well as poor consistency in FWHM across samples and isotopes and poor consistency in pair isotope ratios across samples.
Predictive peak quality model development
The training set was used for development of five models using different statistical approaches, including RRF, regularized logistic regression, SVM with linear and polynomial kernels and KNN methods. Similar to the previous example, the training set was split into training (80%) and validation (20%) subsets. The model was built using the training subset, while the validation subset was used for unbiased evaluation of the performance of the model. Using the validation subset for performance evaluation helps diagnose issues with overfitting and therefore is of great importance in choosing the final model.
Performance of the five developed models was compared in the training as well as the validation subset (Additional file 1: Table S3 and Figure S8). The RRF model yielded the highest classification accuracy, achieving an accuracy of 89.1% and 91.1% in the training and validation subsets, respectively. The accuracy of the other tested models ranged from 80.1% in the linear SVM model to 87.0% in the KNN model. However, the sensitivity of the RRF model for flagging peaks with low quality significantly outperformed that of the other four models and therefore RRF was selected as the final predictive model for peak quality assessment in this dataset. Furthermore, receiver operator characteristics (ROC) analysis of the outcome of the RRF model on the validation subset returned an area under the curve (AUC) of 0.975, which also demonstrates the high performance of the predictive model (Additional file 1: Figure S9). Depending on the difference between the distribution of individual QC features in low and high quality peak groups, each of these metrics plays a role in determining the output of the model. In this dataset, QC features that measure attributes such as peak intensity, modality, correlation and consistency of isotope peak area ratios, peak area CV%, FWHM, and symmetry appear to have a more prominent impact on the output of the model (Additional file 1: Figure S10). Alternatively, jaggedness and shift scores do not seem to play a crucial part in determining the peak quality in the RRF model.
It should be noted that applying the model developed in the artificial matrix to assess the quality of the longitudinal biomarker data and vice versa shows poor performance. This can be attributed to the complexity of the biological matrix in the experiment. The highly complex human CSF matrix containing endogenous biomarkers poses a greater challenge for the LC-MS/MS method when compared to the artificial CSF matrix, which consists of bovine serum albumin with spiked-in standard isotope-labeled peptides. The matrix effect, biological variability of samples and unpredictable interference from the endogenous analytes in CSF compromises the chromatography in the longitudinal biomarker study. This ambiguity complicates the definition of acceptable peaks in the dataset and creates a gray zone, where the assessment of the peak quality may be more subjective. In contrast, using BSA as an artificial CSF matrix creates a controlled sample, where the difference between high and low quality peaks is clearer. This clarity further simplifies annotation of the training set and subsequently optimization of the performance of the model. Moreover, the concentrations of spiked-in AQUA™ peptides in the artificial CSF samples were greater than the limit of quantitation for the vast majority of the peptides, which generates peaks with higher intensities and fewer jagged or irregular peaks. Considering that the peaks that are close to the limit of quantitation are more anomalous in shape and quality, it is expected that a dataset of high intensity peaks, as seen in the experiment with artificial matrix, leads to a more simplified quality assessment process both through manual and automated workflows.
Additional file 1: Figure S11 highlights the transitions that were misclassified by the model in the longitudinal biomarker study in red. Of the 1128 transition pairs in the training set, 7 were misclassified as ‘ok’ (S9 A–F) and 13 were misclassified as ‘flagged’ (S9 G–R) by the model. A majority of misclassified peaks fall into the gray area between a high and low quality peak. Ambiguity surrounds the annotations of flagged data by both the model and through manual inspection. To confirm this hypothesis, we can evaluate the class probabilities for the misclassified peaks. Class probability for each transition pair is the likelihood of that transition falling into a certain class as estimated by the RRF model. For example, a ‘flag’ class probability of 0.5 or higher for a transition, results in flagging of that transition by the model as a low quality peak. Transitions with class probabilities closer to 0.5 are considered marginal cases. The ‘flag’ class probability of the transitions correctly flagged by the model is 0.91 ± 0.10, whereas the probability for low quality misclassified transitions is 0.65 ± 0.13, indicating a higher level of uncertainty about the peaks that were misclassified. Similarly, the ‘ok’ class probability of the transitions that correctly passed QC by the model is 0.91 ± 0.09, while this probability for the low quality transitions that were misclassified as high quality is 0.70 ± 0.15. Many of the misclassified transitions are low intensity peaks that are either below or very close to the limit of quantitation. At such low concentrations, the quality of the peak could be compromised not due to poor chromatography or interference, but only because of the low intensity of the peak and limitations of the detector in resolving the peak. Therefore, defining more rigorous quality assessment criteria at these low levels may decrease the uncertainty of the model and therefore improve its performance.
Peak quality assessment in large targeted MS datasets using the predictive model
The optimized RRF model was applied to the CSF biomarker longitudinal study dataset. The QC features were calculated for each transition pair for 36 peptides quantified in 70 runs and used as the input to the model. TargetedMSQC generated a report based on the output of the model summarizing the QC results for the complete dataset and at sample and peptide levels. Additionally, for each peptide with an isotope pair, model output was visualized for individual transitions and samples. Here, two of these peptides are discussed in more details as examples of how this tool can assist the analyst in method development and data analysis. The full TargetedMSQC report for the CSF biomarker longitudinal study is provided in the Additional file 2.
Figure 3 shows the QC results summary for peptide LGPLVEQGR. Four transitions, y4, y5, y6 and y7, were monitored for quantification of this peptide. The heatmap in Fig. 3 shows the output of the model for individual transitions across 70 samples included in the study. The peak groups for 6 of these runs are depicted in the figure as representatives of the dataset. Sample S30 shows an example of a high quality peak, which has correctly passed QC by the model. As shown in Fig. 3, three of the transitions, y4 (red), y6 (blue) and y7 (purple) passed QC by the model in almost all of the samples. On the other hand, y5 (green) is flagged in the majority of the samples, including samples S22, S55, S60 and S64. LGPLVEQGR shows an example of a peptide that could be accurately quantified across multiple samples by removing the transition with interference (y5) from the quantitative analysis. Upon manual inspection of y5 across these samples, we observed that this transition had interference in many of the flagged samples similar to what is seen in S22 and S64. The interference resulted in high background at the peak boundary (S22: normalized intensity at boundary of 0.17 compared with 0.02 as the median of the dataset) and low similarity between the endogenous and standard pairs (S64: pair similarity score of 0.06 compared with 0.98 as the median of the dataset) and high modality score (S64: transition modality score of 0.34 compared with 0.00 as the median of the dataset). Sample S55 shows an example of a peak with poor chromatography and shoulders, where at least three of the transitions have been flagged due to a shoulder, particularly in the standard peptide. Sample S60 shows an example of a peak with bimodal transitions that has also been correctly flagged by the model. All the transition pairs in this group suffer from jaggedness, poor modality and low similarity scores, all of which are likely the main driving force behind the output for this peak. Although the model performs well in a majority of the samples, there are examples where the output is not as expected. Sample S2 shows a peak group where the y5 transition has interference and needs to be flagged; however the model failed to flag this transition as a low quality transition. It should be noted that the model has predicted a value of 0.492 as the ‘flag’ class probability for this transition, which indicates that the QC features calculated for this transition place it on the verge of being flagged. A solution to improving the sensitivity of the model for detection of marginal low quality peaks such as y5 in sample S2 is to lower the default threshold of 0.5 as the cut-off class probability for flagged peaks in the RRF model. At the expense of increasing the rate of incorrectly flagging high quality peaks, lowering the threshold could help flag more peaks and transitions in the marginal gray area between the two classes.
Figure 4 shows the QC results summary for peptide VLSLAQEQVGGSPEK. Unlike the LGPLVEQGR example, none of the transitions have been systematically flagged across majority of the samples. However, all four monitored transitions seem to be flagged in a number of samples. One could conclude from this data that in these samples the transitions cannot be used to reliably quantify the target peptides. Many of the flagged peaks seem to have failed quality assessment due to low intensity, jagged and tailing peaks. This is particularly evident for the heavy standard isotopes. This is represented in samples S10 and S32. Both have slightly jagged, bimodal and low intensity standard signals. Samples S47 and S61 are highly jagged, slightly tailing and have very low standard signals. Sample S14 is jagged and tailing and therefore flagged for poor quality. Finally, sample S69 shows an example of a peak that has correctly passed QC. It should be noted that although this peak is tailing, the peak quality is otherwise acceptable and therefore similar peaks in the training set were marked as acceptable.