⚠️ It looks like you don't you don't have Javascript enabled in your browser. Without Javascript you won't be able to use some important functionality on Genohub, such as submitting requests or accessing your projects. Please enable Javascript.

Designing Your Next Generation Sequencing Run

Overview Type of Run Read Length Number of Reads Depth of Coverage (DNA) Depth of Coverage (RNA) Replication, Randomization and Multiplexing Poor Quality Sequencing Run Library Preparation Custom Sequencing Primers
Search for NGS Services

⬅️ Back to NGS Handbook

While we’ve standardized the way sequencing runs are ordered across all platforms, customization is required when determining the run parameters to achieve your experimental goals. If you’re new to high throughput sequencing and have questions about how you should design your sequencing run, fill out our free consultation form and we'll get in touch with you to help.

We highly recommend that use Genohub's NGS Matching Engine as a great tool to determine the right amount of sequencing capacity on various instruments and easily explore different options. Simply enter your specifications and instantly see services that match your output requirements.

Type of Run – Single Read (SR) or Paired End (PE)

With single read runs the sequencing instrument reads from one end of a fragment to the other end. Paired end runs read from one end to the other end, and then start another round of reading from the opposite end. Single read runs are faster, cheaper and are typically sufficient for profiling or counting studies such as RNA-Seq or ChIP-Seq.

Paired end runs give additional positioning information in the genome, making it a good choice for de novo genome assembly as well as making it easier to resolve structural re-arrangements such as deletions, insertions and inversions. Experiments designed to study splice variants, epigenetic modifications (methylation) and SNP identification are best served by paired-end runs. While paired end runs could be more costly and time-consuming, you get back twice the amount of data at less than double the cost to sequence.

Read Length

Some sequencing instruments give you flexibility in choosing the number of base pairs (cycles) you can read at one time. The number of cycles corresponds to the output read length. While longer read lengths give you more accurate information on the relative positions of your bases in a genome, they are more expensive than shorter ones. 50 cycles are typically sufficient for simple mapping of reads to a reference genome, and RNA-Seq profiling or counting experiments. Read lengths greater than or equal to 100 are typically chosen for genome or transcriptome studies that require high amounts of output.

Number of Reads

During a DNA sequencing reaction, sequenced base pairs or "reads" are generated. Each sequencing platform and instrument yield different numbers of reads. Genohub's popular NGS Matching Engine automatically calculates the minimum amount of sequencing capacity on each instrument to yield the required number of reads or coverage for your project. The calculations are based on read/unit estimates advertised by instrument manufacturers which are summarized in the table below:

Numbers of Single Reads by Instrument Manufacturer

Platform	Instrument	Unit	Reads / Unit	Reference
Illumina	HiSeq X Ten	Lane	375,000,000	1
Illumina	HiSeq 3000/4000	Lane	312,500,000	1
Illumina	HiSeq NextSeq 500 High-Output	Run	400,000,000	2
Illumina	HiSeq NextSeq 500 Mid-Output	Run	130,000,000	2
Illumina	HiSeq High-Output v4	Lane	250,000,000	3
Illumina	HiSeq High-Output v3	Lane	186,048,000	3
Illumina	HiSeq Rapid Run	Lane	150,696,000	3
Illumina	HiScanSQ	Lane	93,024,000	3
Illumina	GAIIx	Lane	42,075,000	3
Illumina	MiSeq v3	Lane	25,000,000	4
Illumina	MiSeq v2	Lane	16,000,000	3
Illumina	MiSeq	Lane	5,000,000	3
Illumina	MiSeq v2 Micro	Lane	4,000,000	5
Illumina	MiSeq v2 Nano	Lane	1,000,000	5
Ion	Proton I	Chip	60,000,000	6
Ion	PGM 318	Chip	4,000,000	6
Ion	PGM 316	Chip	2,000,000	6
Ion	PGM 314	Chip	400,000	6
PacBio	PacBio RS II	SMRT Cell	47,000	7
PacBio	PacBio RS	SMRT Cell	22,000	7
Roche 454	GS FLX+ / FLX	1 PTP	700,000	8
Roche 454	GS FLX+ / FLX	1/2 PTP	350,000	8
Roche 454	GS FLX+ / FLX	1/4 PTP	125,000	8
Roche 454	GS FLX+ / FLX	1/8 PTP	50,000	8
Roche 454	GS FLX+ / FLX	1/16 PTP	20,000	8
Roche 454	GS Junior	1 PTP	70,000	8
SOLiD	5500xl W	Lane	266,666,667	9
SOLiD	5500 W	Lane	266,666,667	9
SOLiD	5500	Lane	81,500,000	10
SOLiD	5500xl	Lane	81,500,000	10

Depth of Coverage (DNA)

A sequencing run generates reads that sample a genome randomly and independently [1]. These reads are not distributed equally across an entire genome; some bases are covered by fewer reads, some by more reads than the average coverage. Coverage refers to the average number of times a single base is read during a sequencing run. If the coverage is 100 X, this means that on average each base was sequenced 100 times. The more frequently a base is sequenced, the more reliable a base is called, resulting in better quality of your data. The Lander / Waterman equation is one method for determining coverage. C=LN/G, where C is coverage, L is read length, N is the number of reads and G is the haploid genome length. Our NGS Matching Engine page takes care of this calculation for you.

Requirements for coverage will depend on your type of study and are commonly set by a scientific body or journal. We recommend reading the ENCODE Consortium guidelines.

Estimate of Coverage Requirements by Application Type

Application Type	Coverage
DNA-Seq (Re-Sequencing)	30 - 80X
DNA-Seq (De novo assembly)	100X
SNP Analysis / Rearrangement Detection	10 - 30X
Exome	100 - 200X
ChIP-Seq	10 - 40X

For more examples see the Sequencing Coverage Guide.

Depth of Coverage (RNA)

Determining the amount of coverage needed for a RNA-Seq experiment is difficult because different transcripts are expressed at different levels, meaning that more reads will be captured from highly expressed genes while fewer reads will be captured by genes expressed at low levels. Transcriptome complexity, alternate expression, 3’ associated biases and the distribution of expression levels make coverage determinations more difficult. A more useful metric for RNA-Seq is determining the total number of mapped reads. It is important to distinguish between total reads and mapped reads, as not all reads will map onto a reference genome, so the number of usable reads will be less than the number of actual reads. The number of reads that will map depend on the library type, quality of sample and how complete your reference genome is. Determining how many reads you need will depend on how sensitive your experiment needs to be for genes expressed at low levels. Standards are ultimately set by the scientific field and those who are publishing work on related transcriptomes. In general, for large genomes we recommend 25-30 million reads for differential expression studies and ~100-200 million reads to examine rare transcripts, splice variant detection and assembly of a de novo transcriptome. See the table below. Standards for RNA-Seq are also provided by the ENCODE Consortium.

Recommended RNA-Seq Parameters

Optimal sequencing depth for RNA-Seq will vary based on the scientific objective of study but here are some general recommendations based on sample type and application:

Sample Type	Reads Needed for Differential Expression (millions)	Reads Needed for Rare Transcript or De Novo Assembly (millions)	Read Length
Small Genomes (i.e. Bacteria / Fungi)	5	30 - 65	50 SR or PE for positional info
Intermediate Genomes (i.e. Drosophila / C. Elegans)	10	70 - 130	50 – 100 SR or PE for positional info
Large Genomes (i.e. Human / Mouse)	15 - 25	100 - 200	>100 SR or PE for positional info

For more examples see the Sequencing Coverage and Read Depth Guide.

In the end, the goal of expression analysis is to determine the set of all expressed transcripts and their frequencies in a cell at a given time. The number of reads gives us an estimate of the relative expression levels in a cell at a given time. With an accurate measure of transcript length, absolute measurements can be estimated by normalization. One common RNA-Seq measure is reads per kilobase per million reads (RPKM) [2].

RPKM = (10^9 * C) / (N * L)

C is the number of mappable reads on a feature (transcript, exon, etc.), L is the length of feature (in kb), N is the total number of mappable reads (in millions). Since the average transcript length may vary between samples, transcripts per million (TPM) is also used as an expression measure. When the average transcript length is 1 kb, 1 TPM is equal to 1 RPKM, which is approximately 1 transcript per cell. One disadvantage of this approach is that proportional representation of each gene is dependent on the expression level of all genes. Highly expressed transcripts take up a large proportion of sequence reads. Small expression changes in these reads have an affect on transcripts with low expression. To overcome a high dependence on read counts, a variation of RPKM, fragments per kilobase of exon per million mapped fragments (FPKM) [3] normalizes for sequencing depth and builds on the assumption that the number of reads generated for a transcript is proportional to abundance and length. This is also an oversimplification as the length distribution of fragmented RNA needs to be considered. With coverage bias, short transcripts are overly represented while long transcripts are underrepresented.

Many improved normalization methods have been developed. Differential expression software packages, such as DESeq use scaling factors and measure the median of the ratio for each gene and read count over the geometric mean across all samples. Several other normalization procedures now use scaling factors to account for variable library sizes.

Replication, Randomization and Multiplexing

Replicates are essential in any biological experiment, the same goes for high throughput sequencing. Samples are subject to variation thus making biological replicates important for statistical significance and identifying sources of variation. Despite the desire to cut back on replicates to reduce cost, it’s important to remember that there are many factors which may cause a sequencing run or sample to fail. If you don’t have sufficient replicates, you may have to repeat your sequencing run. In general we recommend at least 4 biological replicates for every experiment.

Randomization is a process of assigning biological samples at random to groups or to different groups within an experiment. This reduces bias by equalizing independent variables that have not been accounted for in the experimental design. Randomization reduces instrument effect, systemic bias and the potential for the occurrence and effect of confounding factors (operational, procedural and person confound). The two main sources of variation that contribute to confounding factors are 1) library effects that occur due to reverse transcription and amplification and 2) unit effects (sequencing lanes [Illumina and SOLiD], chips [Ion], plates [Roche 454]) such as poor base calling, bad sequencing cycles. We recommend randomizing your samples by making sure each sequencing unit contains samples from both control and experimental groups. This can be done by barcoding or indexing your samples to allow for multiplexing.

DNA (or cDNA fragments made from RNA) can be labelled with sample specific sequences or barcodes that allow multiple samples to be included in the same sequencing reaction. Multiplexing allows for proper sample identification after the sequencing run is complete. Multiplexing can be used to create balanced, pooled experimental designs. If you have 8 samples that require the sequencing output obtained from 3 Illumina lanes, unit effects can be eliminated by multiplexing all 8 samples and loading each 8 sample multiplexed pool into all 3 lanes. All unit (lane effects) will be the same for each sample. Multiplexing also has the advantage of eliminating phasing issues related to low multiplex pools. Low multiplexed pools can result in no signal in one of the color channels of an index read. The image registration might fail and no base will be called from that cycle. If a base isn’t called then samples will not be able to be demultiplexed.

Poor Quality Sequencing Run

When designing your sequencing experiment, careful attention to the information in this guide must be paid to avoid a failed or poor sequencing run. While not completely avoidable, a poor sequencing run will have several of the following non-informing sequencing reads:

Un-mappable reads
PCR duplicates
Low quality reads
Adapter dimer or sequencing adapter reads
Non-unique mapped reads or poor sequencing diversity
Reads mapping to uninformative sequence (e.g. rRNA)

Library Preparation

While read quality is largely governed by sequencing technology, library preparation can also seriously affect quality and is a major source of coverage variation. Choosing the right library preparation strategy is essential for every sequencing project. Choosing the wrong library preparation approach can cause GC bias, amplification artifacts, unevenness of coverage, poor mapping and uninformative reads. For more information on choosing the right library preparation kit see our NGS Library Preparation Guide.

Custom Sequencing Primer Design

There are two main cases when a user must design and synthesize their own custom Illumina sequencing primers:

A non-standard library preparation protocol changes the sequence where a typical Illumina primer would bind.
The initial part of read 1 contains the same sequence across multiple samples. By extending the Illumina sequencing primer into this constant region, the user can begin sequencing a variable sequence or one of interest. This is a common with amplicon and 16S libraries.

Outside of these cases, the Illumina sequencing primers included in the cluster generation kits are sufficient for standard library sequencing.

Proper design of these primer sequences is critical to the success of the sequencing run. Each of the following factors needs to be considered when designing a custom sequencing primer:

The Tm of the custom primer must match the Tm of the sequencing primer it is designed to replace
The primer should not form any secondary structures or self-anneal to itself
It should be designed so that extension occurs in the 5’ to 3’ direction
Custom primers should be submitted to a service provider at an appropriate volume and concentration (see below)
The primer should be synthesized and HPLC purified to remove short or incomplete sequences

Instruments most amenable to custom sequencing primers

You can use a custom sequencing primer for Read 1, Read 2, or for an Index read on MiSeq and NextSeq platforms. These sequencing primers can be directly loaded into the reagent cartridge. You also have the option to use a combination of custom primers and Illumina primers in the reagent cartridge. This is achieved by spiking in your custom primer. Instructions for loading custom primers on a MiSeq or Nextseq can be found below.

Illumina MiSeq

600 µL of custom primer should be prepared at a concentration of 0.5 µM. 600 µL of custom primer can be loaded into the reagent cartridge at position 18 for custom Read 1, position 19 for a custom Index Read, position 20 for custom Read 2. For more instructions, follow Illumina’s guide on using custom sequencing primers on the MiSeq:

MiSeq Custom Primer Guide

Illumina NextSeq

2 mL of custom primer should be prepared at a concentration of 0.3 µM. 2 mL of each custom primer should be loaded into the reagent cartridge at position 7 for custom Read 1, position 8 for custom Read 2, and position 9 for custom index 1 and 2 (1 mL each if using both index 1 and index 2 primers). For more instructions, follow Illumina’s guide on using custom sequencing primers on the Nextseq:

NextSeq Custom Primer Guide

References

Lander ES, Waterman MS.(1988) Genomic mapping by fingerprinting random clones: a mathematical analysis, Genomics 2(3): 231-239.
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B (2008) Mapping and quantifying mammalian transcriptomes by RNAseq. Nat Methods 5:621–628.
Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L. (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 28(5):511–5.