Tutorial for genomic region test

Basic

The genome data of hg19 should be already configured (How to configure genome data).

Genomic region test requires one RNA sequence and a set of interesting target regions:

  • RNA sequence: A file in FASTA format. For known lncRNAs, you can download their sequences from the UCSC Genome Table Browser.
  • Target regions: A BED file, which contains the regions of interest.

Example data

To demonstrate the genomic region test, we use TERC as an example. In the example files (download here), you can find the folder “TERC_hg19“. There are several files:

  • terc.fasta: RNA sequence of TERC in FASTA format.
  • terc_peaks.bed: Target regions of TERC.
  • Nregions_hg19.bed: The regions in the genome with “N” letters for masking randomization.

You can use the following command to see all the arguments related to the genomic region test:

rgt-TDF regiontest

Analysis by default setting

Given the RNA sequence and interested regions, it is easy to perform the test with the default arguments.

Command:

rgt-TDF regiontest -r terc.fasta -bed terc_peaks.bed -rn TERC -organism hg19 -o genomic_region_test/TERC/

where -r is the lncRNA sequence, -bed is the region of interest, -rn defines the lncRNA name, –organism defines the organism, and -o is the output directory.

Then, the result webpages and graphics are stored in genomic_region_test/TERC/

You can simply open genomic_region_test/TERC/index.html to see all the results and graphics. (see Demo here)

Result Interface

There are three main pages for each test.

  1. RNA page: Shows statistics and graphics of candidate DBDs.
  2. Target region page: Shows DBSs statistics for all target regions and rankings.
  3. Parameters page: Shows the parameters used by TDF.

The main page is the RNA, which will display the main statistics regarding the candidate DBDs. Here, you can find which DBD binds significantly to the target regions.

regionp1

The second page shows the statistics of target regions potentially forming a triple helix with a given RNA. It is possible to define the region with the highest binding possibility and its associated gene.

regionp2

The third page is a list of all used parameters in the test.

regionp3

Advanced options for the genomic region test

We will describe a few relevant options of TDF. See the tool usage for description of all options.

How to change the number of randomization?

The default randomization processes is performed for 10,000 times. You can change it by the argument -n.

rgt-TDF regiontest -r terc.fasta -rn TERC -bed terc_peaks.bed -organism hg19 -n 5000 -o genomic_region_test/TERC_5000

How to add mask to improve selection of random regions?

The default randomization in genomic region test is achieved by rearranging the given target regions to random positions on the whole genome. It may be problematic when the target regions only relate to certain segments, not the whole genome. Providing a mask to filtering out the non-related segments in the genome can avoid this bias.

Using Nregions_hg19.bed as a mask by the following command:

rgt-TDF regiontest -r terc.fasta -rn TERC -bed terc_peaks.bed -organism hg19 -o genomic_region_test/TERC_f/ -f Nregions_hg19.bed

How to change minimum length and error tolerance of triple helix binding sites

When the predicted binding sites are too less to analyze, you can decrease the minimum length (-l) or increase the percentage error tolerance (-e) as follows:

rgt-TDF regiontest -r terc.fasta -rn TERC -bed terc_peaks.bed -organism hg19 -o genomic_region_test/TERC/ -l 15 -e 20

How to output target regions and DBSs in BED format

Sometimes we want to output the interested regions in BED format for further investigation. TDF can output target promoters and DBSs on target promoters by adding an argument, -obed:

rgt-TDF regiontest -r terc.fasta -rn TERC -bed terc_peaks.bed -organism hg19 -o genomic_region_test/TERC/ -obed

Three BED files will appear in the output directory:

  • TERC_target_region_dbs.bed: The target regions which have at least one DBS within it.
  • TERC_dbss.bed: All DBSs between lncRNA and target regions.

The file name TERC is defined by output directory name.

Remove temporary files

TDF generates some temporary files such as .fa and .txp. They can be removed in order to save memory usage by the argument, -rt:

rgt-TDF regiontest -r terc.fasta -rn TERC -bed terc_peaks.bed -organism hg19 -o region_set_test/TERC_rt/ -rt