Tool Usage

To run the motif analysis tool you have to specify the analysis type (matching or enrichment). Furthermore, in order to run the motif enrichment tool you should run the motif match in advance and then the motif enrichment.

Below you will find the following nomenclature:
Required Input: Required input for the program. They are going to be described between < and > They must be passed in order they are described.
Options: Additional input files, paths, parameters or output options.

 

Motif Matching

Command:

rgt-motifanalysis --matching [options] [input1.bed input2.bb ...]

Inputs

Inputs are genomic regions in BED or BigBed format that you wish to perform motif matching on. If an experimental matrix is provided, the inputs are ignored. If a Gene List is provided, then both the inputs and the experimental matrix are ignored, since the regions will be extracted from the promoters of the organism’s genes.

Options

Option Name Type Default Description
--organism String hg19 Describes the organism in which the analysis is being performed. All default files such as genomes will be based on the chosen organism and the data.config file. Check more information on the rgtdata and data.config file.
--fpr Float 0.0001 False positive rate cutoff for motif matching.
--precision Integer 10000 Score distribution precision for determining false positive rate cutoff.
--pseudocounts Float 0.1 Pseudocounts to be added to raw counts of each PFM.
--rand-proportion Float None If set, a random regions file will be created (eg, for later enrichment analysis) and matched against (ie, a corresponding MPBS file created too). The number of coordinates will be equal to this value times the size of the input regions. We advise you use a value of at least 10.
--norm-threshold Boolean False If this option is used, the thresholds for all PWMs will be normalized by their length. In this scheme, the threshold cutoff is evaluated in the regular way by the given fpr. Then, all thresholds are divided by the length of the motif. The final threshold consists of the average between all normalized motif thresholds. This single threshold will be applied to all motifs.
--use-only-motifs Boolean False Only use the motifs contained within this file (one for each line).
--input-matrix Path None If an experimental matrix is provided, the input arguments will be ignored.
--gene-list Path None List of genes (one per line) to get the promoter regions from. If a genes file is provided, the input files and experimental matrix will be ignored.
--make-background Boolean False If set, it will perform motif matching on the ‘background regions’, composed of the promoters of all available genes for the chosen organism. It doesn’t require –gene-list.
--promoter-length Integer None Length of the promoter region (in bp) to be extracted from each gene. Only used if –gene-list is specified.
--output-location Path match Path where the output MPBS files will be written. Defaults to ‘match’ in the current directory.
--bigbed Boolean False If this option is used, all bed files will be written as bigbed.
--normalize-bitscore Boolean False In order to print bigbed files the scores need to be normalized between 0 and 1000. Don’t use this option if real bitscores should be printed in the resulting bed file. Without this option, bigbed files will never be created.

Special Input File Formats

Output

This analysis will populate the specified matching folder with the following files:

  • X_mpbs.[bed or bb]: BED or BigBed file containing the matched motif instances for each input region X (or for the selected genes’ promoter regions, if --gene-list is provided). MPBS stands for Motif-Predicted Binding Sites. If or --make-background are set, then either a random_regions_mpbs.bed or a background_regions_mpbs.bed file will also be present.
  • random_regions.[bed or bb]: BED or BigBed file containing the random background regions (this file is generated only if random regions are requested with --rand-proportion), plus the corresponding _mpbs file.
  • background_regions.[bed or bb]: BED or BigBed file containing the genomic regions corresponding to all genes of the target organism (this file is generated only if such a background is requested with --make-background), plus the corresponding _mpbs file.

 

Motif Enrichment

Command:

rgt-motifanalysis --enrichment [options] <background_bed_file> [input1.bed input2.bb ...]

Inputs

A required input is the file with the regions to be used as a “background” in the enrichment statistics. The file must be in either BED or BigBed format, and it must have a corresponding MPBS file as produced from motif matching. If the background file is named “background.bed”, then the MPBS file must be named “background_mpbs.bed” and be located either in the matching location, or in the same directory as the background.

Input files can be specified just like for matching, in either BED or BigBed format. If an experimental matrix is provided, input files are ignored.

Options

Option Name Type Default Description
--organism String hg19 Organism considered on the analysis. Check our full documentation for all available options. All default files such as genomes will be based on the chosen organism and the data.config file.
--promoter-length Integer 1000 Length of the promoter region (in bp) considered on the creation of the regions-gene association.
--maximum-association-length Integer 50000 Maximum distance between a coordinate and a gene (in bp) in order for the former to be considered associated with the latter.
--multiple-test-alpha Float 0.05 Alpha value for multiple test.
--background-file Path None Path to BED file to be used as background. The corresponding MPBS file is expected to have a ‘mpbs’ suffix appended. For example, ‘background.bed’ should have a corresponding ‘background_mpbs.bed’. The MPBS file is first searched in the matching location, and if not found is searched in the same directory as the background file.
--use-only-motifs PATH None Only use the motifs contained within this file (one for each line).
--matching-location PATH match Directory where the matching output containing the MPBS files resides. Defaults to ‘match’ in the current directory.
--input-matrix PATH None If an experimental matrix is provided, the input arguments will be ignored.
--output-location Path enrichment Path where the output MPBS files will be written. Defaults to ‘enrichment’ in the current directory.
--print-thresh Float 0.05 Only Motif-Predicted Binding Sites (MPBSs) whose factor’s enrichment corrected p-value are less than equal this option are printed. Use 1.0 to print all MPBSs.
--bigbed Boolean False If this option is used, all bed files will be written as bigbed.
--no-copy-logos Boolean False If set, the logos to be showed on the enrichment statistics page will NOT be copied to a local directory; instead, the HTML result file will contain absolute paths to the logos in your rgtdata folder.

Special Input File Formats

In the case gene sets are being used, the experimental matrix may have one additional column. This column will group regions by gene sets. See the example below:

name type file genegroup
# Regions
H1-hESC regions ./Input/regions_H1hesc.bed geneset1
HeLa-S3 regions ./Input/regions_HeLaS3.bed geneset1
HepG2 regions ./Input/regions_HepG2.bed geneset2
K562 regions ./Input/regions_K562.bed geneset2
# Genes
UP_REG genes ./Input/up_regulated_genes.txt geneset1
DW_REG genes ./Input/down_regulated_genes.txt geneset2

In this example, H1-hESC and HeLaS3 cell types will be associated to the up-regulated (UP_REG) genes and HepG2 and K562 will be associated to down-regulated (DW_REG) genes. Commentary lines (starting with #) were added only for clarity but not required.

Output

This analysis will populate the folder specified with --output-location. This folder will contain subfolders with names depending on the analysis type. If gene sets are provided, each folder will be named X__Y, where X is the region file name and Y is the gene set file name associated to that region. Else, if no gene sets are provided, each folder will be named after each region name only. Each of these folders will contain the following files:

  • coord_association.[bed or bb]: File containing the association between gene sets and regions. If no gene sets was provided, this file will associate all regions with all genes for the organism being used.
  • mpbs_ev.[bed or bb]: Contain all Motif-Predicted Binding Sites (MPBS) that occurred in regions that were matched to genes, in case gene sets were given; or in all regions in case no gene sets was provided.
  • mpbs_nev.[bed or bb]: Contain all Motif-Predicted Binding Sites (MPBS) that occured in regions that were NOT matched to genes, in case gene sets were given. In case gene sets were not given, such output is not available.
  • fulltest_statistics.[html and txt]: Contains the enrichment results in html and tab-separated text format. The results include, in order, the motif name, enrichment p-value, corrected enrichment p-value, Fisher’s exact test A value, Fisher’s exact test B value, Fisher’s exact test C value, Fisher’s exact test D value, foreground frequency (A/(A+B)), background frequency (C/(C+D)) and a list of genes that were matched to the regions in which that motif occurred separated by comma.
  • genetest_statistics.[html and txt]: Same as the above files, but regarding the gene association test. Available only if gene sets are specified in the experimental matrix.