Designing Your Next Generation Sequencing Run


⬅️ Back to NGS Handbook

While we’ve standardized the way sequencing runs are ordered across all platforms, customization is required when determining the run parameters to achieve your experimental goals. If you’re new to high throughput sequencing and have questions about how you should design your sequencing run, fill out our free consultation form and we'll get in touch with you to help.

We highly recommend that use Genohub's NGS Matching Engine as a great tool to determine the right amount of sequencing capacity on various instruments and easily explore different options. Simply enter your specifications and instantly see services that match your output requirements.


Type of Run – Single Read (SR) or Paired End (PE)

With single read runs the sequencing instrument reads from one end of a fragment to the other end. Paired end runs read from one end to the other end, and then start another round of reading from the opposite end. Single read runs are faster, cheaper and are typically sufficient for profiling or counting studies such as RNA-Seq or ChIP-Seq.

Paired end runs give additional positioning information in the genome, making it a good choice for de novo genome assembly as well as making it easier to resolve structural re-arrangements such as deletions, insertions and inversions. Experiments designed to study splice variants, epigenetic modifications (methylation) and SNP identification are best served by paired-end runs. While paired end runs are more costly and time consuming, you get back twice the amount of data at less than double the cost to sequence.


Read Length

Some sequencing instruments give you flexibility in choosing the number of base pairs (cycles) you can read at one time. The number of cycles corresponds to the output read length. While longer read lengths give you more accurate information on the relative positions of your bases in a genome, they are more expensive than shorter ones. 50 cycles are typically sufficient for simple mapping of reads to a reference genome, and RNA-Seq profiling or counting experiments. Read lengths greater than or equal to 100 are typically chosen for genome or transcriptome studies that require high amounts of output.


Number of Reads

During a DNA sequencing reaction, sequenced base pairs or "reads" are generated. Each sequencing platform and instrument yield different numbers of reads. Genohub's popular NGS Matching Engine automatically calculates the minimum amount of sequencing capacity on each instrument to yield the required number of reads or coverage for your project. The calculations are based on read/unit estimates advertised by instrument manufacturers which are summarized in the table below:

Numbers of Single Reads by Instrument Manufacturer
Platform Instrument Unit Reads / Unit Reference
Illumina HiSeq X Ten Lane 375,000,000 1
Illumina HiSeq 3000/4000 Lane 312,500,000 1
Illumina HiSeq NextSeq 500 High-Output Run 400,000,000 2
Illumina HiSeq NextSeq 500 Mid-Output Run 130,000,000 2
Illumina HiSeq High-Output v4 Lane 250,000,000 3
Illumina HiSeq High-Output v3 Lane 186,048,000 3
Illumina HiSeq Rapid Run Lane 150,696,000 3
Illumina HiScanSQ Lane 93,024,000 3
Illumina GAIIx Lane 42,075,000 3
Illumina MiSeq v3 Lane 25,000,000 4
Illumina MiSeq v2 Lane 16,000,000 3
Illumina MiSeq Lane 5,000,000 3
Illumina MiSeq v2 Micro Lane 4,000,000 5
Illumina MiSeq v2 Nano Lane 1,000,000 5
Ion Proton I Chip 60,000,000 6
Ion PGM 318 Chip 4,000,000 6
Ion PGM 316 Chip 2,000,000 6
Ion PGM 314 Chip 400,000 6
PacBio PacBio RS II SMRT Cell 47,000 7
PacBio PacBio RS SMRT Cell 22,000 7
Roche 454 GS FLX+ / FLX 1 PTP 700,000 8
Roche 454 GS FLX+ / FLX 1/2 PTP 350,000 8
Roche 454 GS FLX+ / FLX 1/4 PTP 125,000 8
Roche 454 GS FLX+ / FLX 1/8 PTP 50,000 8
Roche 454 GS FLX+ / FLX 1/16 PTP 20,000 8
Roche 454 GS Junior 1 PTP 70,000 8
SOLiD 5500xl W Lane 266,666,667 9
SOLiD 5500 W Lane 266,666,667 9
SOLiD 5500 Lane 81,500,000 10
SOLiD 5500xl Lane 81,500,000 10

Depth of Coverage (DNA)

A sequencing run generates reads that sample a genome randomly and independently [1]. These reads are not distributed equally across an entire genome; some bases are covered by fewer reads, some by more reads than the average coverage. Coverage refers to the average number of times a single base is read during a sequencing run. If the coverage is 100 X, this means that on average each base was sequenced 100 times. The more frequently a base is sequenced, the more reliable a base is called, resulting in better quality of your data. The Lander / Waterman equation is one method for determining coverage. C=LN/G, where C is coverage, L is read length, N is the number of reads and G is the haploid genome length. Our NGS Matching Engine page takes care of this calculation for you.

Requirements for coverage will depend on your type of study and are commonly set by a scientific body or journal. We recommend reading the ENCODE Consortium guidelines.


Estimate of Coverage Requirements by Application Type
Application Type Coverage
DNA-Seq (Re-Sequencing) 30 - 80X
DNA-Seq (De novo assembly) 100X
SNP Analysis / Rearrangement Detection 10 - 30X
Exome 100 - 200X
ChIP-Seq 10 - 40X

For more examples see the Sequencing Coverage Guide.


Depth of Coverage (RNA)

Determining the amount of coverage needed for a RNA-Seq experiment is difficult because different transcripts are expressed at different levels, meaning that more reads will be captured from highly expressed genes while fewer reads will be captured by genes expressed at low levels. Transcriptome complexity, alternate expression, 3’ associated biases and the distribution of expression levels make coverage determinations more difficult. A more useful metric for RNA-Seq is determining the total number of mapped reads. It is important to distinguish between total reads and mapped reads, as not all reads will map onto a reference genome, so the number of usable reads will be less than the number of actual reads. The number of reads that will map depend on the library type, quality of sample and how complete your reference genome is. Determining how many reads you need will depend on how sensitive your experiment needs to be for genes expressed at low levels. Standards are ultimately set by the scientific field and those who are publishing work on related transcriptomes. In general, for large genomes we recommend 25-30 million reads for differential expression studies and ~100-200 million reads to examine rare transcripts, splice variant detection and assembly of a de novo transcriptome. See the table below. Standards for RNA-Seq are also provided by the ENCODE Consortium.


Recommended RNA-Seq Parameters

Optimal sequencing depth for RNA-Seq will vary based on the scientific objective of study but here are some general recommendations based on sample type and application:

Sample Type Reads Needed for Differential Expression (millions) Reads Needed for Rare Transcript or De Novo Assembly (millions) Read Length
Small Genomes (i.e. Bacteria / Fungi) 5 30 - 65 50 SR or PE for positional info
Intermediate Genomes (i.e. Drosophila / C. Elegans) 10 70 - 130 50 – 100 SR or PE for positional info
Large Genomes (i.e. Human / Mouse) 15 - 25 100 - 200 >100 SR or PE for positional info

For more examples see the Sequencing Coverage and Read Depth Guide.

In the end, the goal of expression analysis is to determine the set of all expressed transcripts and their frequencies in a cell at a given time. The number of reads gives us an estimate of the relative expression levels in a cell at a given time. With an accurate measure of transcript length, absolute measurements can be estimated by normalization. One common RNA-Seq measure is reads per kilobase per million reads (RPKM) [2].

RPKM = (10^9 * C) / (N * L) 

C is the number of mappable reads on a feature (transcript, exon, etc.), L is the length of feature (in kb), N is the total number of mappable reads (in millions). Since the average transcript length may vary between samples, transcripts per million (TPM) is also used as an expression measure. When the average transcript length is 1 kb, 1 TPM is equal to 1 RPKM, which is approximately 1 transcript per cell. One disadvantage of this approach is that proportional representation of each gene is dependent on the expression level of all genes. Highly expressed transcripts take up a large proportion of sequence reads. Small expression changes in these reads have an affect on transcripts with low expression. To overcome a high dependence on read counts, a variation of RPKM, fragments per kilobase of exon per million mapped fragments (FPKM) [3] normalizes for sequencing depth and builds on the assumption that the number of reads generated for a transcript is proportional to abundance and length. This is also an oversimplification as the length distribution of fragmented RNA needs to be considered. With coverage bias, short transcripts are overly represented while long transcripts are underrepresented.

Many improved normalization methods have been developed. Differential expression software packages, such as DESeq use scaling factors and measure the median of the ratio for each gene and read count over the geometric mean across all samples. Several other normalization procedures now use scaling factors to account for variable library sizes.


Replication, Randomization and Multiplexing

Replicates are essential in any biological experiment, the same goes for high throughput sequencing. Samples are subject to variation thus making biological replicates important for statistical significance and identifying sources of variation. Despite the desire to cut back on replicates to reduce cost, it’s important to remember that there are many factors which may cause a sequencing run or sample to fail. If you don’t have sufficient replicates, you may have to repeat your sequencing run. In general we recommend at least 4 biological replicates for every experiment.

Randomization is a process of assigning biological samples at random to groups or to different groups within an experiment. This reduces bias by equalizing independent variables that have not been accounted for in the experimental design. Randomization reduces instrument effect, systemic bias and the potential for the occurrence and effect of confounding factors (operational, procedural and person confound). The two main sources of variation that contribute to confounding factors are 1) library effects that occur due to reverse transcription and amplification and 2) unit effects (sequencing lanes [Illumina and SOLiD], chips [Ion], plates [Roche 454]) such as poor base calling, bad sequencing cycles. We recommend randomizing your samples by making sure each sequencing unit contains samples from both control and experimental groups. This can be done by barcoding or indexing your samples to allow for multiplexing.

DNA (or cDNA fragments made from RNA) can be labelled with sample specific sequences or barcodes that allow multiple samples to be included in the same sequencing reaction. Multiplexing allows for proper sample identification after the sequencing run is complete. Multiplexing can be used to create balanced, pooled experimental designs. If you have 8 samples that require the sequencing output obtained from 3 Illumina lanes, unit effects can be eliminated by multiplexing all 8 samples and loading each 8 sample multiplexed pool into all 3 lanes. All unit (lane effects) will be the same for each sample. Multiplexing also has the advantage of eliminating phasing issues related to low multiplex pools. Low multiplexed pools can result in no signal in one of the color channels of an index read. The image registration might fail and no base will be called from that cycle. If a base isn’t called then samples will not be able to be demultiplexed.


Poor Quality Sequencing Run

When designing your sequencing experiment, careful attention to the information in this guide must be paid to avoid a failed or poor sequencing run. While not completely avoidable, a poor sequencing run will have several of the following non-informing sequencing reads:

  1. Un-mappable reads
  2. PCR duplicates
  3. Low quality reads
  4. Adapter dimer or sequencing adapter reads
  5. Non-unique mapped reads or poor sequencing diversity
  6. Reads mapping to uninformative sequence (e.g. rRNA)


Library Preparation

While read quality is largely governed by sequencing technology, library preparation can also seriously affect quality and is a major source of coverage variation. Choosing the right library preparation strategy is essential for every sequencing project. Choosing the wrong library preparation approach can cause GC bias, amplification artifacts, unevenness of coverage, poor mapping and uninformative reads. For more information on choosing the right library preparation kit see our NGS Library Preparation Guide.


Custom Sequencing Primer Design

There are two main cases when a user must design and synthesize their own custom Illumina sequencing primers:

  1. A non-standard library preparation protocol changes the sequence where a typical Illumina primer would bind.
  2. The initial part of read 1 contains the same sequence across multiple samples. By extending the Illumina sequencing primer into this constant region, the user can begin sequencing a variable sequence or one of interest. This is a common with amplicon and 16S libraries.

Outside of these cases, the Illumina sequencing primers included in the cluster generation kits are sufficient for standard library sequencing.

Proper design of these primer sequences is critical to the success of the sequencing run. Each of the following factors needs to be considered when designing a custom sequencing primer:

  1. The Tm of the custom primer must match the Tm of the sequencing primer it is designed to replace
  2. The primer should not form any secondary structures or self-anneal to itself
  3. It should be designed so that extension occurs in the 5’ to 3’ direction
  4. Custom primers should be submitted to a service provider at an appropriate volume and concentration (see below)
  5. The primer should be synthesized and HPLC purified to remove short or incomplete sequences

Instruments most amenable to custom sequencing primers

You can use a custom sequencing primer for Read 1, Read 2, or for an Index read on MiSeq and NextSeq platforms. These sequencing primers can be directly loaded into the reagent cartridge. You also have the option to use a combination of custom primers and Illumina primers in the reagent cartridge. This is achieved by spiking in your custom primer. Instructions for loading custom primers on a MiSeq or Nextseq can be found below.

Illumina MiSeq

600 µL of custom primer should be prepared at a concentration of 0.5 µM. 600 µL of custom primer can be loaded into the reagent cartridge at position 18 for custom Read 1, position 19 for a custom Index Read, position 20 for custom Read 2. For more instructions, follow Illumina’s guide on using custom sequencing primers on the MiSeq:

MiSeq Custom Primer Guide

Illumina NextSeq

2 mL of custom primer should be prepared at a concentration of 0.3 µM. 2 mL of each custom primer should be loaded into the reagent cartridge at position 7 for custom Read 1, position 8 for custom Read 2, and position 9 for custom index 1 and 2 (1 mL each if using both index 1 and index 2 primers). For more instructions, follow Illumina’s guide on using custom sequencing primers on the Nextseq:

NextSeq Custom Primer Guide


References


  1. Lander ES, Waterman MS.(1988) Genomic mapping by fingerprinting random clones: a mathematical analysis, Genomics 2(3): 231-239.
  2. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B (2008) Mapping and quantifying mammalian transcriptomes by RNAseq. Nat Methods 5:621–628.
  3. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L. (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 28(5):511–5.

Other Resources: