Difference between revisions of "QuiXoT"

From PROTEOMICA
Jump to: navigation, search
Line 108: Line 108:
  
 
Although the variances can be adjusted manually (using the ''change_values'' link) until experimental and theoretical curves agree, it should be noted that these deviations are usually indicative of the presence of contaminants, artefacts or outliers that disturb normality and make variance estimation inaccurate. Therefore, it is recommendable to inspect the underlying problem before manual adjustment of variances.  
 
Although the variances can be adjusted manually (using the ''change_values'' link) until experimental and theoretical curves agree, it should be noted that these deviations are usually indicative of the presence of contaminants, artefacts or outliers that disturb normality and make variance estimation inaccurate. Therefore, it is recommendable to inspect the underlying problem before manual adjustment of variances.  
 +
 +
==== Using the linear normality plot ====
 +
 +
A more sensitive method to detect deviations from normality is to use the linear normality plot. In this graph a normal distribution produces a straight line, and the points outside the line correspond to outliers.
 +
 +
'''A note of caution''': estimation of variances from an experiment can only be done when the number of expression changes (or outliers) is low enough not to disturb the underlying normality distribution. As the proportion of proteins having expression changes increases, a point may be reached where there is no way to establish what is the true variance of the null hypothesis (i.e. the underlying experimental variance of the non-changing proteins). In such cases the only solution is to make a parallel, null-hypothesis experiment to estimate the true protein variance and use this variance to analyse the results from the real experiment. This is a classic problem in statics and has nothing to do with QuiXoT performance.
 +
 +
==== Final statistics ====
 +
 +
Once the null hypothesis at the three levels (scan, peptide and protein) is correctly established (i.e. normality is met and variances are correctly calculated), then the final statistical analysis may be performed by pressing the ''stats'' button. Protein expression changes are usually considered as significant when ''FDRq'' is lower than 0.05 (meaning that 5% of significantly changing proteins show expression changes by chance alone). It is easy to highlight the changing proteins in a ''Wq'' vs ''Xq'' plot by applying the filter tool to the ''FDRq'' field. Note that ''FDRq < 0.05'' is a multiple-hypothesis criterion and is a much more stringent condition than ''abs(Zq) > 1.96'', which corresponds to a 0.05 probability that the protein, considered alone, deviates from the normal distribution.
  
  
 
[[Category:QuiXoT]]
 
[[Category:QuiXoT]]

Revision as of 16:14, 12 September 2013

QuiXoT
Screenshot QuiXoT general.PNG
Screenshot of QuiXoT, depicting different spectra and graphs used.
Last release: v.1.4.00
Release date: 20th Aug 2013
Download link: [[{{{link}}}]]
Source code: QuiXoT at GitHub
Licence: Please read Licencing
Requirements


QuiXoT is an open source software created for the quantitation and statistical analysis of quantitative proteomics experiments. It has been developed at the Cardiovascular Proteomics Laboratory of Prof Jesús Vázquez, at the Centro Nacional de Investigaciones Cardiovasculares (CNIC), Madrid, Spain.

It has been developed in Visual C#, hence users must install the .NET Framework 2.0 or higher (not necessary for Windows 7 users), which can be downloaded from this link.

Using QuiXoT

See also the article: DataGrid information in QuiXoT.

Part I: Checking an existent QuiXoT analysis

The QuiXML files

QuiXoT makes use of a QuiXML files, which is an ad hoc XML format created to manage the three levels of information treated: identification, quantitation and statistical information. To check a list of the different fields used in QuiXML files (i.e., the columns appearing in the main window of QuiXoT), you can check the article DataGrid information in QuiXoT.

After dragging and dropping the QuiXML file, you will have to choose the quantitation method.

If you just want to see an existing QuiXoT analysis, you only need the corresponding QuiXML file. Just drag and drop that file on the main form, and select the quantitation method used, which will depend on the SIL method used (such as 18O, SILAC, etc) and the spectrometre used (sich as high or low resolution).

The binStack folder

The spectra are saved in a folder called binStack, which contains one or more .bfr files and one index.idx file. You do not need this folder if you just want to check the results of a QuiXoT analysis (such as the statistics, identifications or the quantitative information).

However, you will need it if you want to requantitate a spectrum, or see the spectrum itself (for example to compare the theoretical and experimental isotopic envelope, which are respectively in red and blue colours). If this is your case, then you should always have the QuiXML file and its corresponding binStack in the same folder (do not forget to move them together).

As far as the binStack and the QuiXML file are in the same place, you do not need to do anything else to load the spectral information.

The configuration files

In the location where you have copied your version of QuiXoT you will find a conf folder containing the configuration files. It contains three kinds of file:

  • the QuantitationMethods.xml file, which contains the parameters of the different methods used. Here it is specified which labelling is associated to a method, or which is the spectrum type that contains the quantitative information (for instance, SILAC quantitation is performed in the full scan if a high resolution spectrometre has been used, but if it is a low resolution machine, then the quantitation is performed in the zoom scan immediately previous to the MS2 scan). Examples of other parametres that can be determined using this file are:
  • width: the tolerance for the high resolution peaks
  • deltaR: the mass difference between 16O and 18O
  • sumSQtolerance_NG, the tolerance accepted for the sum of squares when comparing the theoretical and the experimental spectrum
  • the iTRAQ masstags and their corrections
  • several xml files which containg information such as the weights of each isotope, the composition of each amino acid or their posttranslational modifications. They also include the correspondence between the different residues and their symbols; for instance, "Y" means "tysorine", while "*" may refer to an oxidation in methionine, or a SILAC label on arginine. Examples of these files are:
  • isotopes.xml
  • aminoacids.xml
  • aminoacids_SILAC.xml
  • several xsd files, which are the XML schemas that contain the structure of the QuiXML file depending on each quantitation method. Some examples of these files are
  • identifications_schema_18Ohighres.xsd
  • identifications_schema_mascot_SILAC.xsd

Checking spectra by weight

Inspect quantifications with low Vs values. Sort the table by Vs and inspect the spectra by using the spectrum button. At very low Vs values you will find completely useless spectra (bad fittings, mixtures, high background, etc). You can choose whether eliminating these spectra from the statistics by marking them with numLabel1 = 0, or filtering by a minimum Vs value (for instance Vs > 3). Non-quantifiable peptides (i.e. peptides not containing basic N-terminal residues in 18O-labeling or in SILAC) must also be excluded when calculating variances or performing the statistics.

Labelling efficiency for 18O labelling

If you have used 18O-labelling, you can check the labelling efficiency, prior to other analyses. Plot q_f versus Xs (consult how to create graphs). Since this plot does not differentiate between good and bad quantitations and hence plots together more and less accurate estimations of q_f, it is a good idea to eliminate bad quantitations from the plot by filtering out the data that do not have an arbitrary minimum Vs value (for instance Vs > 30 in ZoomScan-quantitated spectrum). Labelling efficiency must be above 0.8 for the vast majority of peptides. A cloud of points with q_f below 0.7 tending to curve towards the right (increasing Xs values) are indicative of a poorly-labeled experiment.

Part II: Analysing an experiment from scratch

Generating the files

Generate the QuiXML file containing the list of identified peptides. You can do this in different ways:

  • if you have identified using SEQUEST (which includes Proteome Discoverer), you may use pRatio from either the .msf files (or the .srf files, if you use an older version of SEQUEST).
  • if you have identified using Mascot, you may convert the .dat results file using MascotToQuiXML
  • if you have used another program, you will need to convert your data into a tab separated text file, and then parse the resulting table using CSVToQuiXML

You will also need the binStack folder, containing the binary files with the spectral information. It can be generated in different ways:

  • if you used Thermo RAW files, you can convert the spectra using RawToBinStack
  • if you used Mascot generic files (mgf), you can convert the spectra using mgfToBinStack
  • if your spectra are stored in a different file format, you should use an external converter to get a mgf file, and the use mgfToBinStack

Performing the first statistics

Introduce an initial set of statical parameters (k and variances) for the null-hypothesis model by using the change_values link. You can find a list of typical values for these parameters. Make an initial estimation of variances by pressing the var calc button. At this step you will have to tell QuiXoT which columns are going to be used as Xs and Vs (this is useful for multichannel labelling approaches such as, for instance, iTRAQ data, which contains several Xs and Vs values depending on the labels that are to be compared). Accept the newly calculated variances and perform the statistical analysis by pressing the stats button.

Inspecting spectra and peptides

A high resolution spectrum from an 18O-labelled experiment. Notice the light species (the four peaks at the left) and the heavy species (the peaks 5th to 8th). The theoretical peaks are red colour, while the original, experimental spectrum is blue colour. You can see a contaminant (or perhaps another less abundant peptide) on the right side of the spectrum (of course, only blue colour, as it does not match a theoretical spectrum in this case).

Inspect the presence of outliers at the scan and peptide levels by using the graphs button and setting Ws (or Wp) as X, vs Xs (or Xp) as Y, to check whether these data are influencing variance calculation. Sort out the data by FDRs (or FDRp) and check the rows having low FDR values (below 0.05 they are statistically considered as outliers). Typically a negligible proportion of outliers may be found (less than 1% of total); this is normal. However, if the number of outliers is too high, it may be indicative of quantification artefacts and/or problems in the labelling protocol.

Common artefacts at the scan level are rare and may be produced by

a) problems in mass calibration (spectra cannot be fitted to the theoretical mass envelope)
b) excesive noise and/or fluctuations in the detector
c) inadequate fitting parameters in the configuration files.

Common artefacts at the peptide level are, however, much more frequent when peptides are post-digestion labelled (which does not include SILAC). They include:

a) incomplete digestion of one of the samples (this may be easily checked by selecting peptide subpopulations using the st_PartialDig field and the filter tool in Vs versus Xp plots)
b) non homogeneous methionine oxidation in the samples to be compared (this may be easily checked by filtering out by the st_Meth field)
c) partially labelled peptides (with 18O-labelling this is indicated by q_f)

If any of these artefacts are encountered, outliers should be eliminated from variance calculation or statistics by using the filter tool.

A further inspection of proteins showing significant expression changes (low FDRq values) is recommendable at this step, since keratins and other external contaminants like trypsin may not be well-balanced in the two samples and introduce an artefactual variance at the protein level. Eliminate all the quantifications related to these contaminants from the statistics by applying an appropriate filter (consult applying filters to the data).

Variance calculation

Recalculate variances (var calc button), accept the resulting values and repeat the statistics (stats button). Check the null hypothesis behind the data. Press the graphs button, and select either Zs, Zp or Zq as X values and the sigmoidal normality plot option to check the null distributions at the scan, peptide or protein level, respectively. If everything is fine and the k constant and variances are properly calculated, these data (blue line) should produce a sigmoid corresponding to the normal distribution around an average of zero with a standard deviation of one (red line). Deviations of the blue curve in relation to the red curve indicate that the null hypothesis is not valid to analyse the data. There are different kind of deviations:

  • if the blue curve is less steep (less accused slope) than the red curve, it means that the variance has been underestimated.
  • If the blue curve is steeper (higher slope) than the red curve, then the variance has been overestimated.
  • if the blue curve agrees with the red curve in the middle but is higher at low values and/or lower at high values, it may be indicative of the presence of outliers.

Although the variances can be adjusted manually (using the change_values link) until experimental and theoretical curves agree, it should be noted that these deviations are usually indicative of the presence of contaminants, artefacts or outliers that disturb normality and make variance estimation inaccurate. Therefore, it is recommendable to inspect the underlying problem before manual adjustment of variances.

Using the linear normality plot

A more sensitive method to detect deviations from normality is to use the linear normality plot. In this graph a normal distribution produces a straight line, and the points outside the line correspond to outliers.

A note of caution: estimation of variances from an experiment can only be done when the number of expression changes (or outliers) is low enough not to disturb the underlying normality distribution. As the proportion of proteins having expression changes increases, a point may be reached where there is no way to establish what is the true variance of the null hypothesis (i.e. the underlying experimental variance of the non-changing proteins). In such cases the only solution is to make a parallel, null-hypothesis experiment to estimate the true protein variance and use this variance to analyse the results from the real experiment. This is a classic problem in statics and has nothing to do with QuiXoT performance.

Final statistics

Once the null hypothesis at the three levels (scan, peptide and protein) is correctly established (i.e. normality is met and variances are correctly calculated), then the final statistical analysis may be performed by pressing the stats button. Protein expression changes are usually considered as significant when FDRq is lower than 0.05 (meaning that 5% of significantly changing proteins show expression changes by chance alone). It is easy to highlight the changing proteins in a Wq vs Xq plot by applying the filter tool to the FDRq field. Note that FDRq < 0.05 is a multiple-hypothesis criterion and is a much more stringent condition than abs(Zq) > 1.96, which corresponds to a 0.05 probability that the protein, considered alone, deviates from the normal distribution.