SanXoT

From PROTEOMICA
Revision as of 11:34, 5 March 2018 by Mtrevisan (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

SanXoT v2.13 is the central program of the SanXoT Software Package developed in the Jesus Vazquez Cardiovascular Proteomics Lab at Centro Nacional de Investigaciones Cardiovasculares, used to perform integration of experimental data to a higher level (such as integration from peptide data to protein data), while determining the variance between them.

SanXoT needs two input files:

  • the lower level input data file, a tab separated text file containing three columns: the first one with the unique identifiers of each lower level element (such as "RawFile05.raw-scan19289-charge2" for a scan, or "CGLAGCGLLK" for a peptide sequence, or "P01308" for the Uniprot accession number of a protein), the Xi which corresponds to the log2(A/B), and the Vi which corresponds to the weight of the measure). This data have to be pre-calibrated with a certain weight (see help of the Klibrate program).
  • the relations file, a tab separated text file containing a first column with the higher level identifiers (such as the peptide sequence, a Uniprot accession number, or a Gene Ontology category) and the lower level identifiers within the abovementioned input data file.
NOTE: you must include a first line header in all your files.

And delivers six output files:

  • the output data file for the higher level, which has the same format as the lower level data file, but containing the ids of the higher level in the first column, the ratio Xj in the second column, and the weight Vj in the third column. By default, this file is suffixed as "_higherLevel".
  • two lower level output files, containing three columns each: in both, the first column contains with the identifiers of the lower level, the second column contains the Xinf - Xsup (i.e. the ratios of the lower level, but centered for each element they belong to), and the third column is either the new weight Winf (contanining the variance of the integration) or the former untouched Vinf weight. For example, integrating from scan to peptide, these files would contain firstly the scan identifiers, secondly the Xscan - Xpep (the ratios of each scan compared to the peptide they are identifying) and either Wscan (the weight of the scan, taking into account the variance of the scan distribution) or Vscan. By default, these files are suffixed "_lowerNormW" and "_lowerNormV".
  • a file useful for statistics, containing all the relations of the higher and lower level element present in the data file, with a copy of their ratios X and weights V, followed by the number of lower elements contained in the upper element (for example, the number of scans that identify the same peptide), the Z (which is the distance in sigmas of the lower level ratio X to the higher level weighted average), and the FDR (the false discovery rate, important to keep track of changes or outliers). By default, this file is suffixed "_outStats".
  • an info file, containing a log of the performed integrations. Its last line is always in the form of "Variance = [double]". This file can be used as input in place of the variance (see -v and -V arguments). By default, this file is suffixed "_infoFile".
  • a graph file, depicting the sigmoid of the Z column which appears in the stats file, compared to the theoretical normal distribution. By default, this file is suffixed "_outGraph".

Usage:

sanxot.py -d[data file] -r[relations file] [OPTIONS]

Arguments:

  -h, --help          Display basic help and exit.
  -H, --advanced-help Display this help and exit.
  -A, --outrandom=filename
                      To use a non-default name for the randomised relations
                      file (only applicable when -R is in use).
  -a, --analysis=string
                      Use a prefix for the output files. If this is not
                      provided, then the prefix will be garnered from the data
                      file.
  -b, --no-verbose    Do not print result summary after executing.
  -C, --confluence    A modified version of the relations file is used, where
                      all the destination higher level elements are "1". If no
                      relations file is provided, the program gets the lower
                      level elements from the first column of the data file.
  -d, --datafile=filename
                      Data file with identificators of the lowel level in the
                      first column, measured values (x) in the second column,
                      and weights (v) in the third column.
  -D, --removeduplicateupper
                      When merging data with relations table, remove duplicate
                      higher level elements (not removed by default).
  -f, --forceparameters
                      Use the parameters as provided, without using the
                      Levenberg-Marquardt algorithm. Negative variances will
                      be reset to zero (see -F if you do not wish this).
  -F, --forcenegativevariance
                      Though the indirect calculation of variance may lead to
                      a negative value, this has no mathematical meaning and
                      may cause a number of artefacts; hence, by default,
                      negative variances are automatically reset to zero.
                      However, for some analyses, it might be important seeing
                      the effect of original variance; for these cases, use
                      this option to override resetting negative variances to
                      zero.
  -g, --no-graph      Do not show the Zij vs rank / N graph.
  -G, --outgraph=filename
                      To use a non-default name for the graph file.
  -J, --includeorphans
                      In the case all the lower elements pointing to a higher
                      level element are excluded, the default behaviour is
                      removing the higher level element altogether. Adding
                      this option, the lower level elements will be integrated
                      in any case.
  -l, --graphlimits=integer
                      To set the +- limits of the Zij graph (default is 6). If
                      you want the limits to be between the minimum and
                      maximum values, you can use -l.
  -L, --infofile=filename
                      To use a non-default name for the info file.
  -m, --maxiterations=integer
                      Maximum number of iterations performed by the Levenberg-
                      Marquardt algorithm to calculate the variance. If
                      unused, then the default value of the algorithm is
                      taken.
  -M, --minseed=float To use a non-default minimum seed. Default is 1e-3.
  -o, --higherlevel=filename
                      To use a non-default higher level output file name.
  -p, --place, --folder=foldername
                      To use a different common folder for the output files.
                      If this is not provided, the the folder used will be the
                      same as the input folder.
  -r, --relfile, --relationsfile=filename
                      Relations file, with identificators of the higher level
                      in the first column, and identificators of the lower
                      level in the second column.
  -R, --randomise, --randomize
                      A modified version of the relations file is used, where
                      the higher level elements (first column) are replaced by
                      numbers and randomly written in the first column. The
                      numbers range from 1 to the total number of elements.
                      The second column (containing the lowel level elements)
                      remains unchanged.
  -s, --no-steps      Do not print result summary and the steps of every
                      Levenberg-Marquardt iteration.
  -t, --graphtitle=string
                      The graph title (default is
                      "Zij graph for sigma^2 = [variance]").
  -T, --minimalgraphticks
                      It will only show the x secondary line for x = 0, and
                      none for the Y axis (useful for publishing).
  -u, --lowernormw=filename
                      To use a non-default lower level output file name,
                      setting W as weight (default suffix is _lowerNormW).
  -U, --lowernormv=filename
                      To use a non-default lower level output file name,
                      setting V as weight (default suffix is _lowerNormV).
  -v, --var, --varianceseed=double
                      Seed used to start calculating the variance.
                      Default is 0.001.
  -V, --varfile=filename
                      Get the variance value from a text file. It must contain
                      a line (not more than once) with the text
                      "Variance = [double]". This suits the info file from
                      another integration (see -L).
  -W, --graphlinewidth=float
                      Use a non-default value for the sigmoid line width.
                      Default is 1.0.
  -w, --varconf=integer
                      Get the confidence limits of the variance using n
                      by performimg n simultaions.
  -y, --varconfpercent=float
                      Get the higher and lower limits to calculate the limits
                      of the variance (see -w). Default is 0.05.
  -z, --outstats=filename
                      To use a non-default stats file name.
  --emergencysweep    Use a sweep method instead of the Levenberg-Marquardt
                      algorithm if the number of tries (see -m) is reached.
                      Default number of decimals is 3, for different precision
                      use --sweepdecimals.
  --emergencyvariance In the case the maximum iterations are reached (see -m),
                      force the seed variance as emergency variance.
  --tags=string       To define a tag to distinguish groups to perform the
                      integration. The tag can be used by inclusion, such as
                           --tags="mod"
                      or by exclusion, putting first the "!" symbol, such as
                           --tags="!out"
                      Tags should be included in a third column of the
                      relations file. Note that the tag "!out" for outliers is
                      implicit.
                      Different tags can be combined using logical operators
                      "and" (&), "or" (|), and "not" (!), and parentheses.
                      Some examples:
                           --tags="!out&mod"
                           --tags="!out&(dig0|dig1)"
                           --tags="(!dig0&!dig1)|mod1"
                           --tags="mod1|mod2|mod3"
  --randomseed=float  The seed to be used in case the variance calculation
                      requires a random seed to be calculated (default is 0;
                      see also -m and --randomtimer).
  --randomtimer       When this is included, the hash of the current time is
                      used as seed in the case the variance requires a random
                      seed to be recalculated (see -m). If omitted, the seed
                      used is 0. Note --randomtimer overrides --randomseed.
                      For reproducibility, the hash of the time used is
                      included in the infoFile, so using --randomseed with
                      that value should give the exact same results.
  --sweepdecimals=float
                      The number of decimals up to which the variance will be
                      calculated if the maximum number of tries of the
                      Levenberg-Marquardt algorithm is reached (option -m),
                      and the --emergencysweep option is on. Default is 3.
  --xlabel=string     Use the selected string for the X label. Default is
                      "Zij". To remove the label, use --xlabel=" ".
  --ylabel=string     Use the selected string for the Y label. Default is
                      "Rank/N". To remove the label, use --ylabel=" ".

examples (use "sanxot.py" if you are not using the standalone version):

  • To calculate the variance starting with a seed = 0.02, using a datafile.txt

and a relationsfile.txt, both in C:\temp:

sanxot -dC:\temp\datafile.txt -rrelationsfile.txt -v0.02
  • To get fast results of an integration forcing a variance = 0.02922:
sanxot -dC:\temp\datafile.txt -rrelationsfile.txt -f -v0.02922
  • To get an integration forcing the variance reported in the info file at
C:\data\infofile.txt, and saving the resulting graph in C:\data\ instead

of C:\temp\:

sanxot -dC:\temp\datafile.txt -rrelationsfile.txt -f -VC:\data\infofile.txt -GC:\data\graphFile.png