RssGsc

Documentation

To learn how to install and use the program, just watch the following videos:
     Install & run
     Usage example
     Option: Number of sets to select
     Options: Rank sum / Fisher / sort
     Option: Ignore gene sets...

You can find a detailed explanation about rank sum statistics applied to gene set collections here.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

FAQ: Frequently Asked Question(s)

What is RssGsc?

RssGsc is a bio-informatics tool. It is used to identify biologically meaningful gene sets from a ranked list of gene (experimental data). Doing this is often called "enrichment analysis of gene sets".

Are there any other tools for doing "enrichment analysis"?

Of course. By far the mos well known is GSEA.

How does RssGsc compare to GSEA?

The main goal while creating RssGsc was "a simple program" to obtain a "small list of meaningful gene sets". This goal quite different from GSEA's approach, so a comparison is probably unfair (plus, I'm biased :-)

RssGsc is faster than GSEA. In my computer, running GSEA can take 10 to 20 minutes, while RssGsc takes less than 10 seconds (same data, of course).
RssGsc is simpler than GSEA: There are no parameters to tune. GSEA has dozens of parameters that have to/can be tuned up (although the defaults should be pretty good).
RssGsc results are easy to understand: it's just a Top10 list, meanwhile GSEA has hundreds of numbers, plots, etc.

What kind of data do I need to use RssGsc?

You need:
     A collection of gene sets (some are provided, see 'data' directory).
     A list of genes and their experimental values. This is the outcome of your experiment (e.g DNA chips, sequencing, etc.).
     A cache file for rank sum statistics (this file is provided in, see 'data' directory).

What kind of computer do I need to use RssGsc?

It is platform independent (written in java), so any computer can run it. You only need java 1.6 (or higher) installed. Most modern computers have it.

How do I install and run RssGsc?

It's trivial, watch this video.

How do I use RssGsc?

Just load the data and click "Run", watch this video.

What is "Number of sets to select" option and how do I use it?

Watch this video.

What is "Rank sum / Fisher / sort" options and how do I use them?

Watch this video.

What is "Ignore gene sets..." option and how do I use it?

Watch this video.

Can I use it from the command line?

Yes! Command line arguments are:

    java -jar RssGsc.jar geneSetsFile.gmt rankedGenesFile.rnk rsnrPdfFile.prob numberOfGeneSetsToSelect \
                         [minimumGeneSetSize maximumGeneSetSize useInterestingSets useOverRepresentedTerms]

    Where:
        geneSetsFile.gmt         : A gene sets file (gene set collection) in GMT format
                                   (GMT format is explained below)
        rankedGenesFile.rnk      : A ranked list of genes (from your experiment) in RNK format
                                   (RNK format is explained below)
        rsnrPdfFile.prob         : A probability distribution cache file
                                   (provided in 'data/rank_sum_no_replacement.prob')
        numberOfGeneSetsToSelect : How many gene sets do you want to select?

    Optional arguments (Note:you either provide none or you provide all of them!)
        minimumGeneSetSize       : Filter gene sets containing less than minimumGeneSetSize genes
        maximumGeneSetSize       : Filter gene sets containing more than minimumGeneSetSize genes
        useInterestingSets       : When using Fisher exact test, only use gene sets with 
                                   one or more genes (used to speed up the search)
        useOverRepresentedTerms  : When using Fisher exact test, only use gene sets with 
                                   p-vale less than 0.1 (used to speed up the search)

E.g. you can use the following command line (unix like systems):

	java -jar RssGsc.jar ./data/c2.all.v2.5.symbols.gmt \
                         ./data/Diabetes_hgu133a_collapsed_to_symbols.rnk \
                         ./data/rank_sum_no_replacement.prob \
                         10 0 0 false false | less -S

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

FAQ: File formats

What is the format used for gene set collection files?

We use GMT file format. You can read a detailed explanation here

What does c1, c2,..., c5 mean?

It is the MSigDB name. Each gene set collection represents a different information type. This is a transcript from MSigDB web site:

C1: Positional gene sets. Gene sets corresponding to each human chromosome and each cytogenetic band that has at least one gene. These gene sets are helpful in identifying effects related to chromosomal deletions or amplifications, dosage compensation, epigenetic silencing, and other regional effects.

C2: Curated gene sets. Gene sets collected from various sources such as on-line pathway databases, publications in PubMed, and knowledge of domain experts. The gene set page for each gene set lists its source.

C3: Motif gene sets. Gene sets that contain genes that share a cis-regulatory motif that is conserved across the human, mouse, rat, and dog genomes. The motifs are catalogued in Xie, et al. (2005, Nature 434, 338–345) and represent known or likely regulatory elements in promoters and 3'-UTRs. These gene sets make it possible to link changes in a microarray experiment to a conserved, putative cis-regulatory element.

C4: Computational gene sets. Computational gene sets defined by mining large collections of cancer-oriented microarray data.

C5: GO gene sets. Gene sets are named by GO term and contain genes annotated by that term. GSEA users: Gene set enrichment analysis identifies gene sets consisting of co-regulated genes; GO gene sets are based on ontologies and do not generally consist of co-regulated genes.

What is the format used for ranked genes files?

We use RNK file format. You can read a detailed explanation here

What is the format used for 'interesting' genes files?

We use RNK file format, the only difference is that the 'weight' column (second column) is ignored. All genes in the first column are considered 'interesting'.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

FAQ: Etc.

What is the "prob" file?

It is used in order to speed up the algorithm. It takes a long time to calculate the rank sum probability distribution function, so we pre-calculated it and stored the values in this file.

What is the "all.v2.5.symbols.gmt" file?

It's just all the c*v2.5.symbols.gmt files together (I just did: 'cat c*v2.5.symbols.gmt > all.v2.5.symbols.gmt') so you can test significance on all gene sets at once.
Note 1: I don't know if that makes sense for your particular experiment...
Note 2: It's a big gene set collection, so it will take a while.

What is the "p-value"?

Is the probability that the gene sets where selected "just by chance". Very low p-values means that the selected gene sets are "meaningful".
Note: When you select some number of gene sets, the first ones may have a significant p-value, but as you keep adding gene sets, the p-value will go up (less significant). So even if the program reports a p-value of 1.0 (not significant at all, i.e. useless) the first selected gene sets may have a significant p-value.
E.g. Let's say you selected 20 gene sets and the resulting p-value is 1.0 (useless), may be that if you select 10 gene set, the p-value is 0.0000001 (meaningful).

What is "p-value (corrected)"?

It is a p-value is corrected for multiple testing. We use Bonferroni correction, so it is very stringent.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Author: Pablo Cingolani (pcingola@users.sourceforge.net)