Searches for sequence motif instances on the genome. The search is performed using a PSSM (Position-Specific Scoring Matrix), which is a normalized/background-corrected version of a simple PFM (Position Frequency Matrix), which contains the frequency that each nucleotide happens in each position of the motif. Such structure (PSSM) is matched to every position within the regions of interest in the selected genome. This will generate a score for every position in which the PSSM was applied, which can be interpreted as the binding affinity of such cis-acting element to such positions.
In order to define a cutoff in which to accept/reject motif instances based on the affinity score evaluated we use a False Positive Rate approach (FPR). In this criterion, the distribution of scores for a particular PSSM is estimated based on a dynamic programming algorithm. Then a cutoff can be established based on a user-selected p-value. Best results can be achieved with p-values between 10^(-5) to 10^(-3).
Evaluate the enrichment of transcription factors in certain genomic regions. After performing motif matching of transcription factors in certain regions, we can evaluate which transcription factors are more likely to occur in those regions than in “background regions”. There are three types of test available:
1. Input regions vs. Background regions: In this test, all input regions are verified against background genomic regions. These background regions are either user-provided, or random genomic regions generated during “matching” with the same average length distribution of the original input regions. It is highly recommended that the number of background regions to be at least 10 times the number of original regions for statistical significance. A Fisher’s exact test is performed for each transcription factor with the following criteria:
– A = number of input regions with at least 1 occurrence of that particular transcription factor.
– B = number of input regions with no occurrence of that particular transcription factor.
– C = number of random background regions with at least 1 occurrence of that particular transcription factor.
– D = number of random background regions with no occurrence of that particular transcription factor.
After performing the Fisher’s exact test with the variables above, the results for all transcription factors are corrected for multiple testing with the Benjamini-Hochberg procedure.
2. Gene-associated regions vs. Non-gene-associated regions: In some particular analyses, we want to check whether a group of regions that are associated with genes of interest (e.g. up-regulated genes) are enriched for some transcription factors vs. regions that are not associated to those genes. In this case we perform a gene-region association in order to divide our input regions into these two groups. This association considers promoter-proximal regions, gene body and distal regions. After the association, we perform a Fisher’s exact test followed by multiple testing correction as mentioned in the previous analysis type.
2. Promoter regions of input genes vs. Background regions: If a gene list is provided outside of the context of the “gene association test” (ie, NOT in the experimental matrix) then a promoter test is performed. We take all provided genes, find their promoter regions in the target organism and create a “target regions” BED file from those. Motif matching is then performed on the target regions and the provided background. In this case, three ways of specifying background are available: as a normal BED file, randomly generated or with the
--make-background options, a newly-generated background made of the promoter regions of all genes not included in the provided gene list. Note that is usually a big task so it might take a long time, and require a lot of memory.