Searches for sequence motif instances on the genome. The search is performed using a PSSM (Position-Specific Scoring Matrix), which is a normalized/background-corrected version of a simple PFM (Position Frequency Matrix), which contains the frequency that each nucleotide happens in each position of the motif. Such structure (PSSM) is matched to every position within the regions of interest in the selected genome. This will generate a score for every position in which the PSSM was applied, which can be interpreted as the binding affinity of such cis-acting element to such positions.
In order to define a cutoff in which to accept/reject motif instances based on the affinity score evaluated we use a False Positive Rate approach (FPR). In this criterion, the distribution of scores for a particular PSSM is estimated based on a dynamic programming algorithm. Then a cutoff can be established based on a user-selected p-value. Best results can be achieved with p-values between 10^(-5) to 10^(-3).
It is important to notice that this algorithm is time-intensive. For this reason we provide multiprocessing options. The user may select the number of processors to run the motif matching.
Evaluate the enrichment of transcription factors in certain genomic regions. After performing motif matching of transcription factors in certain regions, we can evaluate which transcription factors are more likely to occur in those regions than in “background regions”. There are two types of test available:
1. Input regions vs. Random regions: In this test, all input regions are verified against background genomic regions. These background regions are random genomic regions with the same average length distribution of the original input regions. It is highly recommended that the number of background regions to be at least 10 times the number of original regions for statistical significance. A Fisher’s exact test is performed for each transcription factor with the following criteria:
– A = number of input regions with at least 1 occurrence of that particular transcription factor.
– B = number of input regions with no occurrence of that particular transcription factor.
– C = number of random background regions with at least 1 occurrence of that particular transcription factor.
– D = number of random background regions with no occurrence of that particular transcription factor.
After performing the Fisher’s exact test with the variables above, the results for all transcription factors are corrected for multiple testing with the Benjamini-Hochberg procedure.
2. Gene-associated regions vs. Non-gene-associated regions: In some particular analyses, we want to check whether a group of regions that are associated with genes of interest (e.g. up-regulated genes) are enriched for some transcription factors vs. regions that are not associated to those genes. In this case we perform a gene-region association in order to divide our input regions into these two groups. This association considers promoter-proximal regions, gene body and distal regions. After the association, we perform a Fisher’s exact test followed by multiple testing correction as mentioned in the previous analysis type.
The program automatically chooses the analysis type given the input. If only genomic regions are given, then only analysis 1 is performed. Else if genomic regions and gene lists are given, both analyses are performed.