Home - Pathogen Genomics Training

Before you being this section, navigate to the variant_calling folder. You will use this folder and its contents to learn and practice this section.

Overview

In this section, we will learn how to map sequence reads to a reference sequence and then identify the variants (both single nucleotide polymorphisms (SNPs) and small insertions and deletions (indels)) in the sequence data. We will then use the variants identified to generate a pseudogenome which is created by replacing the bases in reference genome by variants identified at corresponding positions.

Tool(s)

We will use the tool Snippy which is an integrated pipeline that maps reads to a reference genome and finds variants between the reference genome and the sequence reads. It will use as many CPUs as you can give it on a single computer. It is designed with speed in mind, and produces a consistent set of output files in a single folder.

You can snippy from a docker repository using the commands:

docker pull staphb/snippy

Alignment

Snippy requires only three inputs:

reference genome in FASTA format,
sequence read file (s) in FASTQ format,
a folder to put in the results

The following command is used to run snippy:

docker_run staphb/snippy snippy --outdir GPSC46_folder --R1 17150_4#79_1.fastq.gz --R2 
17150_4#79_2.fastq.gz --ref Reference_sequence_GPSC46.fa --cpus 2 --ram 2 --force --quiet

An explanation of these commands are as follows:

docker_run: is a function to start a container. The function includes the following flags: docker run --rm=True -u $(id -u):$(id -g) -v $(pwd):/data "$@". To understand the docker_run function read the section [Data, Tools and Computational Platforms (Docker)]

staphb/snippy: is the docker image

snippy: is the tool

--outdir: specified the output directory which is GPSC46_folder

--R1: specifies the first read which is 17150_4#79_1.fastq.gz

--R2: specifies the first read which is 17150_4#79_2.fastq.gz

--ref: specifies the reference genome which is Reference_sequence_GPSC46.fa

--cpus: specifies maximum number of CPU cores to use = 2

--ram: specifies that the RAM is kept under 2 GB

--force: force overwrite of existing output folder

--quiet: no screen output

There should now be a new folder called GPSC46_folder. Now lets cd to this folder and list ls -alh the content of the directory to check we have our new files, and also check out their sizes.

A description of the output files are as follows:

ref.fa: FASTA version/copy of the reference

ref.ra.fai: Index for the reference file

snps.aligned.fa: A version of the reference but with (-) at position with depth=0 and N for 0 < depth < --mincov (does not have variants)

snps.bam: The alignments in BAM format. Includes unmapped, multimapping reads. Excludes duplicates.

snps.bam.bai: Index for the .bam file

snps.bed: The variants in Browser Extensible Data (BED) format

snps.consensus.fa: A version of the reference genome with which contains all contains all high quality variants

snps.consensus.subs.fa: A version of the reference genome which contains only high quality SNPs (no INDELS)

snps.tab: A simple tab-separated summary of all the variants

snps.csv: A comma-separated version of the .tab file

snps.txt: Tab-separated columnar list of alignment/core-size statistics

snps.filt.vcf: The filtered variant calls from Freebayes

snps.raw.vcf: The unfiltered variant calls from Freebayes

snps.vcf: Multi-sample VCF file with genotype GT tags for all discovered alleles

snps.vcf.gz: Compressed .vcf file

snps.vcf.gz.csi: Index for the .vcf.gz via bcftools index

snps.gff: The variants in GFF3 format

snps.html: A HTML version of the .tab file

snps.log: A log file with the commands run and their outputs

Snippy has done a number of things. You can examine the log file to see exactly what snippy has done. So type:

less snps.log

The output is as shown in the table below:

Press "q" to exit.

You can also use grep to retrieve the relevant lines for each command from the log file.

For example: To see the exact snippy command you entered earlier type the command

grep outdir snps.log

You will get this output

grep bwa snps.log

You will get this output

The command "bwa mem -Y -M -R '@RG\tID:GPSC46_folder\tSM:GPSC46_folder' -t 2 reference/ref.fa /data/17150_4#79_1.fastq.gz /data/17150_4#79_2.fastq.gz | samclip --max 10 --ref reference/ref.fa.fai | samtools sort -n -l 0 -T /tmp --threads 0 -m 1000M | samtools fixmate -m --threads 0 - - | samtools sort -l 0 -T /tmp --threads 0 -m 1000M | samtools markdup -T /tmp --threads 0 -r -s - - > snps.bam" is a combination of several commands combined using pipes (|). A description of this command is as follows:

bwa mem -Y -M -R '@RG\tID:GPSC46_folder\tSM:GPSC46_folder' -t 2 reference/ref.fa /data/17150_4#79_1.fastq.gz /data/17150_4#79_2.fastq.gz: This command uses bwa to map the sequence reads to the reference genome using bwa mem.

samclip --max 10 --ref reference/ref.fa.fai: This command uses samtools to clip/remove unaligned reads.

samtools sort -n -l 0 -T /tmp --threads 0 -m 1000M: This command tells samtools to fill in mate coordinates and insert size fields.

samtools fixmate -m --threads 0 - -: This command tells samtools to fill in mate coordinates and insert size fields.

samtools sort -l 0 -T /tmp --threads 0 -m 1000M: This command tells samtools to sort based on chromosome number and coordinates

samtools markdup -T /tmp --threads 0 -r -s - - : This command uses samtools to remove all the duplicates and also print some basic statistics about the results file.

> snps.bam: snps.bam is the outputfile

To see samtools processing type the command

grep samtools snps.log

You will get this output

To see freebayes processing type the command

grep freebayes snps.log

You will get this output

This command uses freebayes to identify potential variants with coordinates in variant call file (VCF) format

To see bcftools processing type the command

grep bcftools snps.log

You will get this output

The command bcftools view --include 'FMT/GT="1/1" && QUAL>=100 && FMT/DP>=10 && (FMT/AO)/(FMT/DP)>=0' snps.raw.vcf | vt normalize-r reference/ref.fa - | bcftools annotate --remove '^INFO/TYPE,^INFO/DP,^INFO/RO,^INFO/AO,^INFO/AB,^FORMAT/GT,^FORMAT/DP,^FORMAT/RO,^FORMAT/AO,^FORMAT/QR,^FORMAT/QA,^FORMAT/GL' > snps.filt.vcf uses bcftools to filter the variants identified above using fixed thresholds

The command bcftools consensus --sample GPSC46_folder -f reference/ref.fa -o snps.consensus.fa snps.vcf.gz uses bcftools to create a pseudogenome by replacing the bases in reference genome by the filtered substitutions(SNPs) identified above.

Viewing output

snps.vcf

head -n30 snps.vcf

This command prints the first 30 lines of the snps.vcf file.

The first 27 lines here are ‘headers’ and contain information about what has been done to call the variants, and helps you to interpret what different columns mean. All these lines begin with **"#"**

The last 3 lines are individual variants identified, one per line. Variants columns are labelled **#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT GPSC46_folder**

You will get this output

snps.tab

We can view a slightly easier summary of these variants in the snps.tab file

head -n5 snps.tab

This command prints the first 5 lines of the snps.vcf file. In this file, we have not provided gene information, so only the first 6 columns are relevant.

You will get this output

Pseudogenomes

Snippy has also created a fasta file with our pseudosequence genome which is made by replacing the reference bases with the snps identified.

head snps.consensus.subs.fa

This command prints the first few lines of the file "snps.consensus.subs.fa".

You will get this output

Practice exercise

Now let us perform variant calling for a second genome.

1. There are two additional fastq files in the directory (15608_3#13_1.fastq.gz and 15608_3#13_2.fastq.gz). Now run snippy on this pair of fastqs and name the output directory as “GPSC46_folder2”.

Write the code:

2. Examine the results of snippy in your new folder using grep, ls and less

Handling multiple genomes - creating multiple sequence alignment

Running a set of isolate sequences (reads or contigs) against the same reference, you can use the snippy-multi script.

Navigate to the multiple folder and explore its contents. To execute the snippy-multi script, you will need:

Paired end reads (single-end reads or assembled contigs)
Tab separated input file (which contains a list of the forward and reverse reads in the following format: ID, names of R1 reads and names of R2 reads)
Reference sequence

Explore the tab separated input file; it should have the following format

Now lets us use snippy-multi to generate a multiple sequence alignment file.

First run this command to generate an output script

docker_run staphb/snippy snippy-multi list.txt --ref Reference_sequence_GPSC46.fa --cpus 2 > 
runme.sh

An explanation of this command is as follows

staphb/snippy: is the tool

list.txt: Tab separated input file (which contains a list of the forward and reverse reads in the following format: ID, names of R1 reads and names of R2 reads)

--ref: specifies the reference genome which is Reference_sequence_GPSC46.fa

--cpus: specifies maximum number of CPU cores to use = 2

> runme.sh: is the output script

You will get this output

You can check that the script makes sense by running the command:

less runme.sh

You will get this output if the script is correct

Now, generate the alignment by running the command:

docker_run staphb/snippy sh ./runme.sh

An explanation of this command is as follows

staphb/snippy: is the tool

sh: is the command name of the Bourne shell, the standard command language interpreter of Unix and many Unix-like operating systems, including Linux

/.runme.sh: the output script in the current directory (/.)

Note: the tool snippy-multi will also run snippy-core to generate the core genome SNP alignment files core.*

Snippy has now created a number of files, including a ‘core SNP alignment’

Run this command to check output

ls -l core.*

An explanation of this command is as follows

ls -l: is the command for long listing all items with core as prefix (core.*)

You will get this output

Viewing output

We have various files that summarise our variants, e.g., run this command to check the content of some these files:

head core.tab

And our multiple sequence alignment containing all genomes is in the file (core.full.aln). View this file:

head core.full.aln

You will get this output

This file “core.full.aln” masks sequences with low confidence in different ways, but for some applications we want everything masked in the same way. Let’s change that so anything uncertain is marked as ’N’ using the snipp-clean_full_aln script that comes with snippy.

docker_run staphb/snippy snippy-clean_full_aln core.full.aln > clean.full.aln

View this file:

head clean.full.aln

You will get this output

Practice exercise

Now let us perform variant calling for Streptococcus agalactiae. Move to the S.agalactiae folder which is in the Variant_calling folder. Run snippy on this pair of fastq files (20280_5#33_1.fastq.gz and 20280_5#33_2.fastq.gz) and name the output directory as “gbs_folder”. There is a multiple folder within the directory - now run snippy on these reads.

Write the code:

Examine the results of snippy in your new folder using grep, ls and less

Generate multiple sequence alignment

Further reading: Snippy