Tutorial¶

First steps¶

Installing GAMtools¶

The first step in the GAMtools tutorial is to make sure that GAMtools is properly installed. Try to run gamtools --help and make sure that you get the following ouput:

$ gamtools --help
usage: gamtools [-h]
              {call_windows,convert,enrichment,matrix,permute_segregation,process_nps,resolution,select}
              ...

If this command gives you an error message, it is likely that GAMtools has not been installed correctly. Please ensure you have followed the steps outlined in the Installation guide.

Downloading the tutorial data¶

Once GAMtools is working correctly, you need to download some example data to work with during the tutorial. The tutorial data is located on the GAMtools website. Download the tutorial data (e.g. by using wget), extract it and cd into the newly created directory. The directory should contain a folder called fastqs and a file called clean.sh.

$ wget http://gam.tools/tutorial_data.tar.gz
$ tar zxvf tutorial_data.tar.gz
$ cd gamtools_tutorial
$ ls
clean.sh  fastqs/

The fastqs folder contains sequencing data from 100 separate nuclear profiles (NPs):

$ ls fastqs/
NP_001.fq.gz  NP_026.fq.gz  NP_051.fq.gz  NP_076.fq.gz
NP_002.fq.gz  NP_027.fq.gz  NP_052.fq.gz  NP_077.fq.gz
NP_003.fq.gz  NP_028.fq.gz  NP_053.fq.gz  NP_078.fq.gz
NP_004.fq.gz  NP_029.fq.gz  NP_054.fq.gz  NP_079.fq.gz
NP_005.fq.gz  NP_030.fq.gz  NP_055.fq.gz  NP_080.fq.gz
...
NP_025.fq.gz  NP_050.fq.gz  NP_075.fq.gz  NP_100.fq.gz

These files are the primary raw output of a GAM experiment. The first thing we need to do with the sequencing data is to “map” it to a genome. The example data comes from mouse embryonic stem cells, so we need to map it to the mouse genome, which we will do using bowtie2. If you already have bowtie2 and a mouse genome assembly installed and configured on your local machine, you can skip the next step (mouse assembly mm9 is preferred, but any other assembly should work with this tutorial).

Configuring bowtie2¶

If you have not yet installed bowtie2, please follow the installation instructions on the bowtie2 homepage. Once you have bowtie installed, verify that everything is working correctly:

$ bowtie2 --version
/home/rob_000/bowtie2-2.2.9/bowtie2-align-s version 2.2.9
64-bit
Built on Windows8
30 Apr 2016 18:13:39

We next need to provide the sequence of the mouse genome for bowtie to map against. If you wish, you can download and configure the full mouse mm9 “index” from Illumina. However, the 100 sequencing datasets provided as part of the tutorial only contain sequencing data from a small region of chromosome 19, so you can also use a special truncated index containing only the sequence of mouse chromosome 19. This will allow bowtie to run much faster whilst using less RAM, and is perfectly sufficient for completing this tutorial. If you wish to use the tutorial index, download it from the GAMtools website, extract it to the same folder as fastqs and configure bowtie to use the new truncated index:

$ wget http://gam.tools/tutorial_index.tar.gz
$ tar zxvf tutorial_index.tar.gz
$ ls
clean.sh fastqs/ genome/
$ export BOWTIE2_INDEXES=$(pwd)/genome/
$ ls $BOWTIE2_INDEXES
genome.1.bt2  genome.3.bt2  genome.rev.1.bt2  chr19.size
genome.2.bt2  genome.4.bt2  genome.rev.2.bt2

Mapping the sequencing data and calling positive windows¶

The GAMtools command used for mapping NP sequencing data is gamtools process_nps. The process_nps command has a lot of different parameters and options, you can use the --help flag to get a full description of all the available parameters. Further information about the process_nps command can also be found on the process_nps page.

$ gamtools process_nps --help
usage: gamtools process_nps [-h] -g GENOME_FILE [-o OUPUT_DIRECTORY]
                            [-f FITTINGS_DIRECTORY] [-d DETAILS_FILE] [-i]
                            [-b] [-c] [-w WINDOW_SIZE [WINDOW_SIZE ...]] [-m]
                            [-s MATRIX_SIZE [MATRIX_SIZE ...]]
                            [--qc-window-size QC_WINDOW_SIZE]
                            [--additional-qc-files [ADDITIONAL_QC_FILES [ADDITIONAL_QC_FILES ...]]]
                            [-q MINIMUM_MAPQ] [--doit-db-file DEP_FILE]
                            [--doit-backend {sqlite3,json,dbm}]
                            [--doit-verbosity {0,1,2}]
                            [--doit-reporter {json,console,zero,executed-only}]
                            [--doit-process NUM_PROCESS]
                            [--doit-parallel-type {process,thread}]
                            INPUT_FASTQ [INPUT_FASTQ ...]

For now, we can just use the default options. That means that all we need to specifiy is a genome file (using -g/--genome-file) and a list of input fastq files:

$ gamtools process_nps -g genome/chr19.size fastqs/*.fq.gz

This tells GAMtools to use the genome file genome/chr19.size .You will have this file if you downloaded the special truncated index. If you are using your own mouse genome index, you will have to specify your own genome file (which is usually named something like mm9.chrom.sizes). The next argument tells GAMtools to process all of the files with the extension “.fq.gz” in the folder called “fastqs”. When you run the command, GAMtools will start mapping the sequencing data, and you should see an output like this:

$ gamtools process_nps -g genome/chr19.size fastqs/*.fq.gz
-- Creating output directory
.  Mapping fastq:fastqs/NP_025.fq.gz
.  Mapping fastq:fastqs/NP_017.fq.gz
.  Mapping fastq:fastqs/NP_065.fq.gz
.  Mapping fastq:fastqs/NP_014.fq.gz
.  Mapping fastq:fastqs/NP_090.fq.gz
.  Mapping fastq:fastqs/NP_078.fq.gz

GAMtools will then proceed to map all 100 individual sequencing files to the mouse genome. This will take around 5 minutes if you are using the truncated index and a moderately fast computer. If you are using your own full mouse genome index, it may take a little longer. Once it has mapped the files, GAMtools will sort the mapped files, remove PCR duplicates and create an index for fast data retrieval.

The final steps are to compute the number of reads from each NP that overlap each 50kb window in the supplied genome file, and then to use this read coverage count to determine which of the windows was present in the original NP. After performing this “window calling” step, gamtools produces a file called segregation_at_50kb.table. This file contains one row per 50kb window, and one column per NP:

# Show the first 10 rows and first 5 columns of the segregation table
$ head segregation_at_50kb.table | cut -f 1-5
chrom   start   stop    fastqs/NP_027.rmdup.bam       fastqs/NP_020.rmdup.bam
chr19   0       50000   0       0
chr19   50000   100000  0       0
chr19   100000  150000  0       0
chr19   150000  200000  0       0
chr19   200000  250000  0       0
chr19   250000  300000  0       0
chr19   300000  350000  0       0
chr19   350000  400000  0       0
chr19   400000  450000  0       0

For each NP column, 0 indicates that the window was not present in the NP, whereas 1 indicates that the window was present. This table is the crucial and most important output of a GAM experiment - all further downstream analysis will generally be based on the segregation table.

Producing proximity matrices¶

Now that we have produced a segregation table at 50kb resolution, we can use it to calculate a proximity matrix, using the gamtools matrix command. As for the process_nps command, the matrix command has a lot of different options, which can be explored further using the --help flag or on the gamtools matrix page.

$ gamtools matrix --help
usage: gamtools matrix [-h] -r REGION [REGION ...] -s SEGREGATION_FILE
                       [-f {csv.gz,txt,csv,txt.gz,npz}]
                       [-t {cosegregation,linkage,dprime}] [-o OUTPUT_FILE]

optional arguments:
  -h, --help            show this help message and exit
  -r REGION [REGION ...], --regions REGION [REGION ...]
                        Specific genomic regions to calculate matrices for. If
                        one region is specified, a matrix is calculated for
                        that region against itself. If more than one region is
                        specified, a matrix is calculated for each region
                        against the other. Regions are specified using UCSC
                        browser syntax, i.e. "chr4" for the whole of
                        chromosome 4 or "chr4:100000-200000" for a sub-region
                        of the chromosome.
  -s SEGREGATION_FILE, --segregation_file SEGREGATION_FILE
                        A segregation file to use as input
  -f {csv.gz,txt,csv,txt.gz,npz}, --output-format {csv.gz,txt,csv,txt.gz,npz}
                        Output matrix file format (choose from: csv.gz, txt,
                        csv, txt.gz, npz, default is txt.gz)
  -t {cosegregation,linkage,dprime}, --matrix-type {cosegregation,linkage,dprime}
                        Method used to calculate the interaction matrix
                        (choose from: cosegregation, linkage, dprime, default
                        is dprime)
  -o OUTPUT_FILE, --output-file OUTPUT_FILE
                        Output matrix file. If not specified, new file will
                        have the same name as the segregation file and an
                        extension indicating the genomic region(s) and the
                        matrix method

We can start by asking for the proximity matrix for our region of interest in png format:

$ gamtools matrix -s segregation_at_50kb.table \
> -r chr19:10,000,000-15,000,000 -o my_matrix.png
starting calculation for chr19:10,000,000-15,000,000
region size is: 100 x 100 Calculation took 1.05s
Saving matrix to file my_matrix.png
Done!
$ open my_matrix.png

You should see an image file that looks like this:

Note that the example data for this tutorial only covers this specific region of chromosome 19, so if you specify a larger or different region you will get some strange looking results:

$ gamtools matrix -s segregation_at_50kb.table \
> -r chr19:8,000,000-17,000,000 -o larger_matrix.png
starting calculation for chr19:8,000,000-17,000,000
region size is: 180 x 180 Calculation took 3.47s
Saving matrix to file larger_matrix.png
Done!
$ open larger_matrix.png

By default, GAMtools produces proximity matrices using the normalized linkage disequilibrium (or D’). In this case, it first calculates how many times each pair of windows are found together in the same NP, and then normalizes the matrix according to how many times each window is detected across the collection of NPs. You can create raw, un-normalized co-segregation matrices by specifying the cosegregation option using the -t/--matrix-type flag:

$ gamtools matrix -s segregation_at_50kb.table \
> -r chr19:10,000,000-15,000,000 -o cosegregation_matrix.png \
> -t cosegregation
starting calculation for chr19:10,000,000-15,000,000
region size is: 100 x 100 Calculation took 1.05s
Saving matrix to file cosegregation_matrix.png
Done!
$ open cosegregation_matrix.png

Working at different resolutions¶

If we want to produce a proximity matrix at a resolution other than 50kb, we first need to calculate a segregation table at that resolution. We can generate another segregation table using the process_nps command, specifying the resolution using the -w/--window-sizes flag. For example at 30kb resolution:

$ gamtools process_nps -w 30000 -g genome/chr19.size fastqs/*.fq.gz
-- Creating output directory
-- Mapping fastq:fastqs/NP_025.fq.gz
-- Mapping fastq:fastqs/NP_017.fq.gz
-- Mapping fastq:fastqs/NP_065.fq.gz
-- Mapping fastq:fastqs/NP_014.fq.gz
-- Mapping fastq:fastqs/NP_090.fq.gz
-- Mapping fastq:fastqs/NP_078.fq.gz
...
...
...
.  Getting coverage:30kb windows
.  Calling positive windows:30kb

Notice that all the lines except the last two begin with --, whereas the last two lines begin with .. The -- indicates that GAMtools realized that these tasks have already been completed and therefore do not need to be re-run. When we re-calculate a segregation table at a new resolution, we don’t need to remap all the individual fastq files, we only need to re-compute the read depth over all 30kb windows, and then decide which 30kb windows were positive in each NP.

To create proximity matrices at the new resolution, we need to specify the new segregation table: segregation_at_30kb.table.

$ gamtools matrix -s segregation_at_30kb.table \
> -r chr19:10,000,000-15,000,000 -o 30kb_matrix.png
starting calculation for chr19:10,000,000-15,000,000
region size is: 167 x 167 Calculation took 0.047s
Saving matrix to file 30kb_matrix.png
Done!
$ open 30kb_matrix.png

Performing quality control checks¶

If you are generating your own GAM datasets, you will want to perform some checks to ensure your NPs are of sufficient quality. GAMtools will generate a table of QC parameters automatically for each NP if you use the process_nps command with the -c/--do-qc flag.

Note

Performing quality control requires a number of additional dependencies to be installed. Please ensure that gamtools test runs with no errors before continuing with this section.

Re-running the gamtools process_nps command with the --do-qc flag will instruct GAMtools to run a number of additional tasks. Your output should look something like this:

$ gamtools process_nps --do-qc -g genome/chr19.size fastqs/*.fq.gz
-- Creating output directory
-- Mapping fastq:fastqs/NP_025.fq.gz
-- Mapping fastq:fastqs/NP_017.fq.gz
-- Mapping fastq:fastqs/NP_065.fq.gz
...
...
...
.  Creating QC parameters file with default values
.  Getting mapping stats
.  Getting segregation stats
.  Running fastqc:fastqs/NP_042.fq.gz
.  Running fastqc:fastqs/NP_043.fq.gz
...
.  Running fastqc:fastqs/NP_070.fq.gz
.  Running fastq_screen:fastqs/NP_063.fq.gz
.  Running fastq_screen:fastqs/NP_050.fq.gz
...
.  Running fastq_screen:fastqs/NP_081.fq.gz
.  Getting quality stats
.  Getting contamination stats
.  Merging stats files
.  Finding samples that pass QC
.  Filtering samples based on QC values:50kb

By default, GAMtools generates several QC files, each containing different information about the collection of NPs:

The number of sequenced, mapped, and unique (i.e. excluding PCR duplicates) reads are saved in mapping_stats.txt

Statistics regarding the number and distribution of positive windows are saved in segregation_stats.txt

Statistics regarding the sequencing quality scores and the number of mono- and di-nucleotide repeat containing reads are calculated by fastqc and saved to quality_stats.txt

Statistics regarding the percentage of reads mapping to different genomes (i.e. contaminating reads) are calculated by fastq_screen and saved to contamination_stats.txt

These statistics files are merged together and the resulting table containing all the different QC parameters is saved to merged_stats.txt

Once the merged stats table has been saved, GAMtools will attempt to filter out “poor quality” NPs, and generates a file called samples_passing_qc.txt containing only high-quality NPs. GAMtools filters out NPs which match any rules in the qc_parameters.cfg file, which is created with some default rules if it does not exist. Finally, GAMtools creates new segregation tables that exclude poor-quality NPs. In our case, this file will be called segregation_at_50kb.passed_qc.table. You can use this new segregation table to re-generate the proximity matrices (see Producing proximity matrices).