Configuration of Genomic Data
When RGT is installed, it will automatically create a folder that stores additional data (default: ~/rgtdata). This data includes chromosome sizes, position frequency matrices (describing transcription factor motifs), HTML scripts, etc. Some tools require data too big to fit in the installation procedure, such as genomes and genomic annotations. In this section we will describe how to obtain these data.
Automatic Data Setup
The easiest way to obtain all data sets required by RGT is to run the setupGenomicData.py python script inside the installed data directory. This will download the files from public servers and will take a few minutes. If you use MAC OS, make sure the command “wget” is available (further instructions here).
The following command will install all the necessary human genome (hg19) data sets:
cd ~/rgtdata python setupGenomicData.py --hg19
The following command will install all the necessary mouse genome (mm9) data sets:
cd ~/rgtdata python setupGenomicData.py --mm9
The following command will install all available data sets: (hg19, hg38, mm9, mm10, zv9, and zv10)
cd ~/rgtdata python setupGenomicData.py --all
This script has further options that can be viewed with:
python setupGenomicData.py -h
Customize RGT Data Folder
The data.config File
The data.config file contains the default data set names (inside RGT data path) used by RGT tools. It is divided into sections (with labels in brackets), such as [GenomeData] and [MotifData].
For each genome assembly, there are five fields targeting to the relevant files. You can customize these paths by yourself. Below is the example for hg19:
|Field Name||Default Value||Description|
|genome||genome_hg19.fa||Sequence of assembly hg19 in FASTA format. This data set is not available upon installation. See instructions above on how to obtain this data set.|
|chromosome_sizes||chrom.sizes.hg19||Chromosome sizes file of assembly hg19.|
|gene_regions||genes_hg19.bed||Gene locations in BED format (from Gencode annotation file in GTF format).|
|annotation||gencode.v19.annotation.gtf||Gene annotation from Gencode version 19 for human in GTF format. This data set is not available upon installation. See instructions above on how to obtain this data set.|
|gene_alias||alias_human.txt||Alias file which allows for translation between multiple different gene IDs.|
You should never modify the data.config file! This is due to the fact that every RGT installation will overwrite it. You can however customize the data.config.user file, by copying a similar section from the data.config file and modifying it to your wishes. For example, to use data from the organisms the user is interested in studying you simply create a section with the genome name and define all the relevant paths.
For example, here is a customized genome for Arabis thaliana (TAIR10):
[tair10] genome: path/to/genome_tair10.fa chromosome_sizes: path/to/chrom.sizes.tair10 gene_regions: path/to/genes_tair10.bed annotation: path/to/tair10.annotation.gtf gene_alias: path/to/alias_tair10.txt
The files that should be defined include:
- Genome fasta file: These files must contain one sequence for each chromosome. Each sequence header must be the chromosome symbol (such as “chr1” for chromosome 1). It can be obtained from several resources, including the UCSC Downloads Website.
- Gene annotation file: It is a BED file containing the genomic coordinates of each gene for the selected organism. It can be downloaded, among other places, in the UCSC Table Browser.
- Chromosome sizes: It is a tab-separated plain text file with two columns. The first must contain the chromosome alias and the second must contain the length of the chromosome in base pairs. It can be fetched for some organisms using the fetchChromSizes script available at the UCSC Utilities Website.
- GTF annotation file: A GTF file.
- Gene ID/Symbol aliases: A tab delimited file with three columns. Each row describes a gene and its aliases. The first column contains the gene’s ENSEMBL ID. The second column contains the gene’s official symbol (or user’s symbol of preference). The third column contains an ampersand(&)-separated list of aliases.
The following table describes the data.config path fields:
|Field Name||Default Value||Description|
|pwm_dataset||motifs||Contains the path to the motif position weight matrices (PWM) repositories.|
|logo_dataset||logos||Contains the path to the logo graphs (graphical depiction of PWMs). This data set is not available upon installation. For more information on how to create this data set click here.|
|repositories||jaspar_vertebrates,uniprobe_primary||The PWM repositories that will be used in the analyses. It is a comma-separated list of folders inside <pwm_dataset> (see this option above) folder. For information on how to add additional repositories, click here.|
RGT Data Folder Structure
After installation, the RGT data folder will contain the following data.
- Organism Folders: Currently, we provide data for Homo sapiens (hg19, hg38) , Mus musculus (mm9, mm10), and Danio rerio (zv9, zv10). Inside these folders, you can find information regarding gene annotation and chromosome sizes.
- fig: Default figures and HTML style files.
- fp_hmms: Contains default hidden Markov model files for HINT tool.
- motifs: Contain position weight matrices (PWMs) for vertebrates obtained in many different repositories.
Additional folders may be created regarding tool-specific data.