Home



Overview

The information of biological sequences (nucleotides and amino acids) is stored in various file formats for easy access and analysis. Some of these files are in human-readable (text) format and others are in machine-readable (binary) format.


Fasta files

Fasta file is a text file which consists of a single and multiple sequences. Though it is not mandatory, fasta files generally have an extension of .fa, .fas, .fsa, .fna or .fasta. Each sequence has a one-line header, starting with the “>”, followed by the nucleotide (or amino acid) sequence. This sequence can be in a single line or span across multiple lines.

logo

Fastq files

This section is also described in the F2 module
As the technologies evolved to sequence millions of “short reads”, storing the information in multiple files became difficult to manage and also consumed more disk space. To address these difficulties, a new file format, fastq (fast Q) was introduced in 2009. Now, fastq format is the de facto standard of high-throughput sequence file format for next generation sequencers.
Fastq is a text file format (human-readable). It has 4 lines for each sequence entry.

  • Header: starts with “@”
  • Sequence
  • Optional field: starts with “+”
  • Quality scores

As the sequence and quality correspond to each other, the length of the 2nd and 4th lines are always equal.

logo

  • 1st line is the sequence identifier which contains information about the sequencing process and is unique for each read
  • 2nd line is the sequence of bases read during sequencing process
  • 3rd line start with “+” sign and is the separator and sometimes can contain the same information as identifier minus the “@” symbol
  • 4th line contains the quality scores. The quality scores represent the probability of a base being read incorrectly by the sequencer. The higher the quality score the less is the probability of it being incorrect. The quality scores are represented by ASCII characters and can be used to infer the quality of bases. You can read more about the FastQ format here. The widely accepted threshold for quality score/value is 20, below which the base is considered to be low quality and is usually excluded or removed from the sequence data before proceeding for downstream analysis. A quality score of 20 would mean the probability of incorrect base is 1 in 100.
  • Note: The FastQ files generated by the sequencers are usually zipped, due to its large size. It is unwise to unzip these files. Downstream bioinformatics analysis tools are almost always able to work with the zipped file. Depending on which sequence program you select, ‘single end’ or ‘paired-end’ will determine how many files you have. For Illumina reads, there will be a single ‘fastq.gz’ file for the single end sequencing and two files for the paired-end sequencing, denoting the read1 (_R1.fastq.gz) and the read2 (_R2.fastq.gz).
    Most often, paired-end data is stored in two files and they have “R1” and “R2” in their file names. The order of the reads in paired-end files correspond to their pairing order, i.e. entry 1 in R1 file is sequenced from the same template as entry 1 in R2 file, and so on and so forth. Therefore, the total number of sequences should always be equal in R1 and R2 files. If for any reason, any entry is removed from R1, its corresponding entry from R2 should also be removed.
    Sometimes paired-end data is saved in interleaved fastq format, storing forward and reverse reads sequentially in a single file.

    Phred score

    A quality score (PHRED scale) for each base pair. It indicates how confident we can be that the base was sequenced and identified correctly. Q = -10log­10(p)

    where p is the probability that the corresponding base call is incorrect.

    Higher Q scores indicate a smaller probability of error.

    Lower Q scores can result in a significant portion of the reads being unusable. They may also lead to increased false-positive variant calls, resulting in inaccurate conclusions.

    A quality score of 20 (Q20) represents an error rate of 1 in 100 (meaning every 100 bp sequencing read may contain an error), with a corresponding call accuracy of 99%. When sequencing quality reaches Q30, virtually all of the reads will be perfect, with no errors or ambiguities. This is why Q30 is considered a benchmark for quality in next-generation sequencing (NGS).

    Fastq-sanger holds PHRED scores from 0-93 whereas fastq-Illumina provides PHRED scores from 0-62. Rather than giving numeric values of PHRED score they are provided in ASCII character codes from 33 to 126. Why 33 to 126? Because 33 to 126 codes for single characters, so the score can be represented by a single character. Refer to the table below.

    Based on the base character (a character that represents zero PHRED score ), the PHRED scale is often referred to as PHRED+33 (ASCII character !) or PHRED+64 (ASCII Character ?).

    Phred table

    logo

    The table depicts the error rate seen in the sequencing as per the Phred score. On an average the Qscore should be more than 30 to accept it as a good read.

    logo

    Sequence Alignment Map (SAM) format

    The sequence alignment map (SAM) format is a tab-delimited text file format developed to store read mapping. Every SAM file has two sections:

    • Header section (optional)
    • Alignment section

    logo

    Entries in the header section always start with “@” and come before the alignment section. Each line in the header is tab-delimited and has a two letter header code called TAG. They follow the “TAG:VALUE” format. These TAG are:

    • HD - The Header Line - 1st line
    • SO - Sorting order of alignments (unknown(default), unsorted, query-name and coordinate)
    • SQ - Reference sequence dictionary
    • SN - Reference sequence name
    • LN - Reference sequence length
    • PG - Program
    • ID - The program ID
    • PN - The program name
    • VN - Program version number
    • CL - The Command actually used to create the SAM file
    • RG - Read Group - “a set of reads that were all the product of a single sequencing run on one lane”
    In the alignment section, there are 11 mandatory fields. These are:
    • QNAME: Read Name
    • FLAG: Info on if the read is mapped, part of a pair, strand etc
    • RNAME: Reference Sequence Name that the read aligns to
    • POS: Leftmost position of where this alignment maps to the reference
    • MAPQ:Mapping quality of read to reference (phred scale P that mapping is wrong)
    • CIGAR: Compact Idiosyncratic Gapped Alignment Report
    • RNEXT:Paired Mate Read Name
    • PNEXT:Paired Mate Position
    • TLEN:Template length/Insert Size (difference in outer coordinates of paired reads)
    • SEQ:The actual read DNA sequence
    • QUAL:ASCII Phred quality scores (+33)
    • TAGS:Optional data - Lots of options e.g MD=String for mismatches
    Let us take a close look at FLAG and CIGAR values.
    FLAG is a sum of alignment bit flags. Below is the table showing what each bit corresponds to.
    Bit Description
    0x1 Template having multiple segments in sequencing
    0x2 Each segment properly aligned according to the aligner
    0x4 Segment unmapped
    0x8 Next segment in the template unmapped
    0x10 SEQ being reverse complemented
    0x20 SEQ of the next segment in the template being reversed
    0x40 The first segment in the template
    0x80 The last segment in the template
    0x100 Secondary alignment
    0x200 Not passing quality controls
    0x400 PCR or optical duplicate
    For example, Flag 99 codes for

    64:first in the segment

    16:read reverse complemented

    8:second read is unmapped

    2:each segment is properly aligned

    1:paired end reads

    This is the only combination that gives the total of 99

    Binary Alignment Map (BAM) format

    Binary Alignment Map (BAM) is a compressed SAM file. It is compressed using the BGZF compression method.

    CRAM files

    CRAM files are also compressed SAM files, designed by the EBI to reduce the storage space. CRAM files compression is based on the reference the data is aligned to.

    Data is compressed using one of the general purpose compressors (gzip, bzip2). CRAM records are compressed using several different encoding strategies. For example, bases are reference compressed by encoding base differences rather than storing the abses themselves. External reference sequences introduce the only external dependency into the CRAM format. When external reference sequences cannot be conveniently used the reference sequences also can be embedded within the CRAM files. However, when embedded reference sequences are used then only those reference sequence regions are preserved in CRAM that has reads aligned against them.

    Genbank format

    The Genbank format allows systematic storage of information about the sequence. An example of the genebank file can be accessed here;(adapted from future learn). A typical Genbank file has two parts: i) annotation section which starts with the line begining with the word "LOCUS", and ii) the sequence which starts with the line begining with "ORIGIN". The end of the genebank file is indicated by "//".

    logo

    GFF3 format

    GFF3 stands for gene feature file version 3. This is a tab-delimited file containing all the information that can be associated with a DNA or protein sequence. An example can be seen in the below figure.

    The file contains 9 fields:

    • Sequence ID
    • Source : algorithm used to derive the feature such as prodigal, prokka, Genescan etc.
    • Feature type : deails what the feature is (cds, mRNA)
    • Feature start
    • Feature stop
    • Score : these are e-values from the algorithm used
    • Strand
    • Phase : describes the reading frame relative to the reference where the featire begins. It has values 0, 1 and 2 for indicating number of based from the beginning where the first codon on the feature begins
    • Attributes : provides additional information about each feature

    Further reading

    SAM format:

    CRAM format: https://www.ebi.ac.uk/sites/ebi.ac.uk/files/groups/ena/documents/cram_format_1.0.1.pdf

    GFF3 format: https://learn.gencore.bio.nyu.edu/ngs-file-formats/gff3-format/

    Embl file format: http://scikit-bio.org/docs/0.5.3/generated/skbio.io.format.embl.html