How to use SanXoT
The SanXoT Software Package (SSP) allows the statistical analysis of quantitative proteomics data (obtained using other programs such as Proteome Discoverer), using the WSPP model[1]. This model analyses relative (rather than absolute) quantitative values, e.g. the ratios between two quantitative measurements.
The SSP contains a set of modules that can be used to assemble user-configurable quantitative workflows with very high flexibility.
In the fundamental workflow the quantitative results at the scan level are integrated to quantify each one of the peptides. The quantitative data at the protein level are then integrated to obtain quantitative protein values. Finally the quantitative data at the protein level are integrated providing the statistical significance that the abundance of each one of the proteins are significantly increased or decreased. An example of this basic workflow is provided in the unit tests for SanXoT.
Among other possibilities, SanXoT allows building up workflows to automatically process very large number of quantitative experiments, to integrate results obtained from technical or biological replicates or to perform Systems Biology analysis using the SBT model[2]. Examples of these applications are provided in the unit tests for SanXoT.
Contents
General design
All SanXoT workflows contain two initial steps:
- a first step where the raw quantitative information is extracted from the output of the quantification program and organized according to the data structure of SanXoT. You can see more about how to prepare the data for SanXoT analysis.
- a second step where the raw quantitative information is processed to obtain prior weights. This process is called calibration. For more details, check the section Calibrating the data.
SanXoT then performs WSPP-statistics by applying iteratively the Generic Integration Algorithm (GIA)[2] on the calibrated data. Each GIA integration performs an independent statistics and its output may be used as input for the following integration step.
In the fundamental workflow The GIA is applied to integrate scan-level data to peptide-level data, to integrate peptide-level data to protein-level and finally to integrate protein-level data.
This modular structure allows users to design their own quantitative workflows applying iteratively GIA integrations. You can check some examples of advanced applications developed using this modular structure.
Data structure
Each GIA integration needs as input tab-separated text tables[2]:
- a data file containing the quantitative data of the lower level. This file contains three columns: identifier (a text string that is used to identify the element), quantitative value (2-base logarithm of the ratio of the two quantitative measurements) and prior weight (a parameter that measures the accuracy of the quantitative value before performing the integration).
- a relations table, which establishes the correspondence of the identifiers in the lower level with those in the upper level. This file contains two columns, this first one contains the identifiers of the upper level and the second one the identifiers of the lower level.
Each GIA integration generates an output data file containing the quantitative data of the upper level. This file can be used as input for the next GIA integration. The relations table for the next integration, however, needs to be prepared in advance.
Note that SanXoT checks the column number (not the column name). The first row containing the header is removed, so the user can use column names freely (as far as one header row is included).
Preparing the data
The data obtained by the software used to obtain the quantitative values must be converted to generate:
- an uncalibrated data file containing three columns, identifier, quantitative values (2-base logarithm of the ratios of the two quantitative measurements) and quality parameter. The quality parameter is a number that is used to rank the quantitative values according to their quality (in terms of variance, higher quality meaning a lower variance). In the case of isobaric labeling the third column contains the intensity of the most intense of the two reporter ions used to calculate the ratio; other kind of quantifications may need to use other parameters to account for the quality of quantitative values[1].
- the relations tables of all the GIA integrations planned in the workflow. In the fundamental workflow these tables describe which scans will be integrated into each peptide and which peptides will be integrated into each protein.
These tables are easy to construct, due to their simple plain-text format. If the software used for quantification generates tab-separated text tables, one of the SanXoT modules (Aljamia), may be used for this task. In the unit tests for SanXoT we include a workflow that uses Aljamia to generate all the files required from the output of Proteome Discoverer 2.1. Quantitative values produced by the software package in other formats may require other simple scripts to generate the files required by SanXoT.
Calibrating the data
Calibration is the process where the quality parameter (the parameter which is used to rank quantitative values according to their variance) is transformed into the prior weight, a parameter with full statistical meaning defined as the inverse of the local variance of the quantitative value in log2-ratio scale. This process is done using the Klibrate module and needs to be performed only once in each workflow. For more information about calibration, see exploring SanXoT features.
Using prior weights is essential in GIA integrations, since these parameters not only reflect the local variances of the measurement, but are also used as statistical weights to calculate the weighted averages. Besides, GIA integrations follow error propagation theory, so that the output datafile of each integration contain true prior weights and do not need to be recalibrated.
Performing integrations and removing outliers
Each GIA integration takes the input data file from the lower level and the relations table, calculates the general variance of the integration (using a robust iterative method) and generates as output a data file containing the integrated quantifications of the upper level. The integration is done using the SanXoT module.
In addition to the quantitative data, each GIA integration may generate several additional files which contain information about the integration. More detailed information may be found in exploring SanXoT features.
One of the advantages of the WSPP model is that it provides a straightforward method to eliminate outliers[1] in each integration step. This is done by calculating standardized log2-ratios (z-values), calculating the probability that each element of the lower level is a significant outlier of the z distribution (e.g., N(0,1)) and calculating the associated FDR. The most extreme outliers may be removed sequentially and the integration repeated, until all outliers below a user-defined FDR-level are removed.
Outlier removal is performed by the SanXoTSieve module by tagging the outlier elements in the relations table. The entire process is done in three steps:
- a first integration is done with SanXoT, which serves to calculate the variance
- outliers are iteratively tagged in the relations table at a user-defined FDR-level using SanXoTSieve
- a second integration is done with SanXoT using the fixed variance calculated in the first step and removing from the integration the outliers tagged in the relations table.
Interpreting the results of GIA integrations
GIA integrations basically generate the quantitative data and their associated variances at the upper level, and detect and eliminate outliers that can affect the values at the upper level. However, each integration also generates various information that may be very useful to interpret the results. The most important is the general variance of the integration.
The general variance is an estimate of the error introduced when the elements of the lower level are experimentally generated from those of the upper level. This parameter may have different meaning depending on the nature of the data.
When spectra are integrated into peptides, the general variance estimates the error associated to the measurement (independently of peptide and protein manipulation), and hence depends on the LC-MS setup. When peptides are integrated into proteins, the variance estimates the error produced at the time of peptide generation (protein digestion, peptide desalting), independently of LC-MS analysis. When proteins are integrated into the grand mean, the variance mainly estimates the error produced during protein manipulation (extraction,…). Similarly, integrating technical and biological replicates allow to estimate separately the different error sources at the protein level. Integration of proteins into categories gives a useful insight into protein coordination[2].
Building the workflows
Typically, workflows contain three main parts:
- Data parsing to generate the uncalibrated data files and all relations tables needed. This may be performed using Aljamia or other scripts provided by the user.
- Data calibration to generate the (calibrated) data files. This is performed using Klibrate.
- A list of concatenated (or branched) GIA integrations. As explained above, each GIA is performed by running sequentially SanXoT (to calculate the general variance), SanXoTSieve (to tag outliers) and SanXoT (to integrate without outliers)
Note that some specific GIA integrations need the use of other modules. These include experiment merging and Systems biology analysis. See advanced applications and exploring SanXoT features.
Advanced applications
SanXoT also allows to use GIA to integrate technical or biological replicates. In this case scans are integrated to peptides and then to proteins separately for each replicate and then the protein-level data are integrated to obtain protein averages (grouped data). A further GIA integration allows calculation of statistical significance of averaged protein changes.
Merging of experiments may be performed by the aid of Cardenio (or using scripts prepared by the user). Cardenio is used to generate suitable relations tables to make protein averages from technical or biological replicates, and to normalize ("center") the data to take into account the systematic quantitative error of each experiment before integration.
Systems biology analysis can be performed using the SBT model[2] by building up a triangular workflow. Proteins are integrated into categories and categories are integrated to provide statistical significance.
Systems biology information can be downloaded, for example, from DAVID. The SanXoT Software Package includes the Camacho module to convert such DAVID tab-separated files into SanXoT relations files that can be used by the SBT model. Alternatively, users can generate relations files for systems biology (either custom or downloaded) just by creating tab-separated tables with two columns: the first one for categories, and the second one for proteins.
Protein categories can be managed using SanXoTSqueezer and Sanson and SanXoTGauss. For more detailed information see exploring SanXoT features.