⬅️ Back to NGS Handbook
While we’ve standardized the way sequencing runs are ordered across all platforms, customization
is
required when determining the run parameters to achieve your experimental goals. If you’re new
to
high throughput sequencing and have questions about how you should design your sequencing run,
fill
out our free consultation form and we'll
get in touch with you to help.
We highly recommend that use
Genohub's NGS Matching Engine
as a great tool to determine the right amount of sequencing capacity on various instruments and
easily explore different options. Simply enter your
specifications
and instantly see services that match your output requirements.
Type of Run – Single Read (SR) or Paired End (PE)
With single read runs the sequencing instrument reads from one end of a fragment to the other
end.
Paired end runs read from one end to the other end, and then start another round of reading from
the
opposite end. Single read runs are faster, cheaper and are typically sufficient for profiling or
counting studies such as
RNA-Seq
or ChIP-Seq.
Paired end runs give additional positioning information in the genome, making it a good choice
for
de novo genome assembly as well as making it easier to resolve structural re-arrangements such
as
deletions, insertions and inversions. Experiments designed to study splice variants, epigenetic
modifications (methylation) and SNP identification are best served by paired-end runs. While
paired
end runs are more costly and time consuming, you get back twice the amount of data at less than
double the cost to sequence.
Read Length
Some sequencing instruments give you flexibility in choosing the number of base pairs (cycles)
you
can read at one time. The number of cycles corresponds to the output read length. While longer
read
lengths give you more accurate information on the relative positions of your bases in a genome,
they
are more expensive than shorter ones. 50 cycles are typically sufficient for simple mapping of
reads
to a reference genome, and RNA-Seq profiling or counting experiments. Read lengths greater than
or
equal to 100 are typically chosen for genome or transcriptome studies that require high amounts
of
output.
Number of Reads
During a DNA sequencing reaction, sequenced base pairs or "reads" are generated. Each sequencing
platform and instrument yield different numbers of reads. Genohub's popular
NGS Matching Engine automatically
calculates the minimum amount of sequencing capacity on each instrument to yield the required
number
of reads or
coverage for your project. The calculations are based on read/unit estimates advertised by
instrument
manufacturers which are summarized in the table below:
Numbers of Single Reads by Instrument Manufacturer
Platform |
Instrument |
Unit |
Reads / Unit |
Reference |
Illumina |
HiSeq
X Ten |
Lane |
375,000,000 |
1
|
Illumina |
HiSeq 3000/4000 |
Lane |
312,500,000 |
1
|
Illumina |
HiSeq NextSeq 500 High-Output |
Run |
400,000,000 |
2
|
Illumina |
HiSeq NextSeq 500 Mid-Output |
Run |
130,000,000 |
2
|
Illumina |
HiSeq High-Output v4 |
Lane |
250,000,000 |
3
|
Illumina |
HiSeq High-Output v3 |
Lane |
186,048,000 |
3
|
Illumina |
HiSeq Rapid Run |
Lane |
150,696,000 |
3
|
Illumina |
HiScanSQ
|
Lane |
93,024,000 |
3
|
Illumina |
GAIIx
|
Lane |
42,075,000 |
3
|
Illumina |
MiSeq v3 |
Lane |
25,000,000 |
4
|
Illumina |
MiSeq v2 |
Lane |
16,000,000 |
3
|
Illumina |
MiSeq
|
Lane |
5,000,000 |
3
|
Illumina |
MiSeq v2 Micro |
Lane |
4,000,000 |
5 |
Illumina |
MiSeq v2 Nano |
Lane |
1,000,000 |
5 |
Ion |
Proton
I |
Chip |
60,000,000 |
6
|
Ion |
PGM
318 |
Chip |
4,000,000 |
6
|
Ion |
PGM
316 |
Chip |
2,000,000 |
6
|
Ion |
PGM
314 |
Chip |
400,000 |
6
|
PacBio |
PacBio RS II |
SMRT Cell |
47,000 |
7 |
PacBio |
PacBio
RS |
SMRT Cell |
22,000 |
7 |
Roche 454 |
GS
FLX+ / FLX |
1 PTP |
700,000 |
8 |
Roche 454 |
GS
FLX+ / FLX |
1/2 PTP |
350,000 |
8 |
Roche 454 |
GS
FLX+ / FLX |
1/4 PTP |
125,000 |
8 |
Roche 454 |
GS
FLX+ / FLX |
1/8 PTP |
50,000 |
8 |
Roche 454 |
GS
FLX+ / FLX |
1/16 PTP |
20,000 |
8 |
Roche 454 |
GS
Junior |
1 PTP |
70,000 |
8 |
SOLiD |
5500xl
W |
Lane |
266,666,667 |
9
|
SOLiD |
5500
W |
Lane |
266,666,667 |
9
|
SOLiD |
5500
|
Lane |
81,500,000 |
10
|
SOLiD |
5500xl
|
Lane |
81,500,000 |
10
|
Depth of Coverage (DNA)
A sequencing run generates reads that sample a genome randomly and independently [1]. These reads
are
not distributed equally across an entire genome; some bases are covered by fewer reads, some by
more
reads than the average coverage. Coverage refers to the average number of times a single base is
read during a sequencing run. If the coverage is 100 X, this means that on average each base was
sequenced 100 times. The more frequently a base is sequenced, the more reliable a base is
called,
resulting in better quality of your data. The Lander / Waterman equation is one method for
determining coverage. C=LN/G, where C is coverage, L is read length, N is the number of reads
and G
is the haploid genome length. Our
NGS Matching Engine
page takes care of this calculation for you.
Requirements for coverage will depend on your type of study and are commonly set by a scientific
body
or journal. We recommend reading the ENCODE Consortium
guidelines.
Estimate of Coverage Requirements by Application Type
For more examples see the Sequencing
Coverage Guide.
Depth of Coverage (RNA)
Determining the amount of coverage needed for a
RNA-Seq
experiment is difficult because different transcripts are expressed at different levels, meaning
that more reads will be captured from highly expressed genes while fewer reads will be captured
by
genes expressed at low levels. Transcriptome complexity, alternate expression, 3’ associated
biases
and the distribution of expression levels make coverage determinations more difficult. A more
useful
metric for RNA-Seq is determining the total number of mapped reads. It is important to
distinguish
between total reads and mapped reads, as not all reads will map onto a reference genome, so the
number of usable reads will be less than the number of actual reads. The number of reads that
will
map depend on the library type, quality of sample and how complete your reference genome is.
Determining how many reads you need will depend on how sensitive your experiment needs to be for
genes expressed at low levels. Standards are ultimately set by the scientific field and those
who
are publishing work on related transcriptomes. In general, for large genomes we recommend 25-30
million reads for differential expression studies and ~100-200 million reads to examine rare
transcripts, splice variant detection and assembly of a de novo transcriptome. See the table
below.
Standards for RNA-Seq are also provided by the
ENCODE Consortium.
Recommended RNA-Seq Parameters
Optimal sequencing depth for RNA-Seq will vary based on the scientific objective of study but
here
are
some general recommendations based on sample type and application:
Sample Type |
Reads Needed for Differential Expression (millions) |
Reads Needed for Rare Transcript or De Novo Assembly (millions) |
Read Length |
Small Genomes (i.e. Bacteria / Fungi) |
5 |
30 - 65 |
50 SR or PE for positional info |
Intermediate Genomes (i.e. Drosophila / C. Elegans) |
10 |
70 - 130 |
50 – 100 SR or PE for positional info |
Large Genomes (i.e. Human / Mouse) |
15 - 25 |
100 - 200 |
>100 SR or PE for positional info |
For more examples see the Sequencing
Coverage and Read Depth Guide.
In the end, the goal of expression analysis is to determine the set of all expressed transcripts
and
their frequencies in a cell at a given time. The number of reads gives us an estimate of the
relative expression levels in a cell at a given time. With an accurate measure of transcript
length,
absolute measurements can be estimated by normalization. One common RNA-Seq measure is
reads per kilobase per million reads (RPKM) [2].
RPKM = (10^9 * C) / (N * L)
C is the number of mappable reads on a feature (transcript, exon, etc.), L is the length of
feature
(in kb), N is the total number of mappable reads (in millions). Since the average transcript
length
may vary between samples, transcripts per million (TPM) is also used as an expression measure.
When
the average transcript length is 1 kb, 1 TPM is equal to 1 RPKM, which is approximately 1
transcript
per cell. One disadvantage of this approach is that proportional representation of each gene is
dependent on the expression level of all genes. Highly expressed transcripts take up a large
proportion of sequence reads. Small expression changes in these reads have an affect on
transcripts
with low expression. To overcome a high dependence on read counts, a variation of RPKM,
fragments
per kilobase of exon per million mapped fragments (FPKM) [3] normalizes for sequencing depth and
builds on the assumption that the number of reads generated for a transcript is proportional to
abundance and length. This is also an oversimplification as the length distribution of
fragmented
RNA needs to be considered. With coverage bias, short transcripts are overly represented while
long
transcripts are underrepresented.
Many improved normalization methods have been developed. Differential expression software
packages,
such as DESeq
use scaling factors and measure the median of the ratio for each gene and read count over the
geometric mean across all samples. Several other normalization procedures now use scaling
factors to
account for variable library sizes.
Replication, Randomization and Multiplexing
Replicates are essential in any biological experiment, the same goes for high throughput
sequencing.
Samples are subject to variation thus making biological replicates important for statistical
significance and identifying sources of variation. Despite the desire to cut back on replicates
to
reduce cost, it’s important to remember that there are many factors which may cause a sequencing
run
or sample to fail. If you don’t have sufficient replicates, you may have to repeat your
sequencing
run. In general we recommend at least 4 biological replicates for every experiment.
Randomization is a process of assigning biological samples at random to groups or to different
groups
within an experiment. This reduces bias by equalizing independent variables that have not been
accounted for in the experimental design. Randomization reduces instrument effect, systemic bias
and
the potential for the occurrence and effect of confounding factors (operational, procedural and
person confound). The two main sources of variation that contribute to confounding factors are
1)
library effects that occur due to reverse transcription and amplification and 2) unit effects
(sequencing lanes [Illumina and SOLiD], chips [Ion], plates [Roche 454]) such as poor base
calling,
bad sequencing cycles. We recommend randomizing your samples by making sure each sequencing unit
contains samples from both control and experimental groups. This can be done by barcoding or
indexing your samples to allow for multiplexing.
DNA (or cDNA fragments made from RNA) can be labelled with sample specific sequences or barcodes
that
allow multiple samples to be included in the same sequencing reaction. Multiplexing allows for
proper sample identification after the sequencing run is complete. Multiplexing can be used to
create balanced, pooled experimental designs. If you have 8 samples that require the sequencing
output obtained from 3 Illumina lanes, unit effects can be eliminated by multiplexing all 8
samples
and loading each 8 sample multiplexed pool into all 3 lanes. All unit (lane effects) will be the
same for each sample. Multiplexing also has the advantage of eliminating phasing issues related
to
low
multiplex pools. Low multiplexed pools can result in no signal in one of the color
channels
of an index read. The image registration might fail and no base will be called from that cycle.
If a
base isn’t called then samples will not be able to be demultiplexed.
Poor Quality Sequencing Run
When designing your sequencing experiment, careful attention to the information in this guide
must
be paid to avoid a failed or poor sequencing run. While not completely avoidable, a poor
sequencing
run will have several of the following non-informing sequencing reads:
- Un-mappable reads
- PCR duplicates
- Low quality reads
- Adapter dimer or sequencing adapter reads
- Non-unique mapped reads or poor sequencing diversity
- Reads mapping to uninformative sequence (e.g. rRNA)
Library Preparation
While read quality is largely governed by sequencing technology, library preparation can also
seriously affect quality and is a major source of coverage variation. Choosing the right library
preparation strategy is essential for every sequencing project. Choosing the wrong library
preparation approach can cause GC bias, amplification artifacts, unevenness of coverage, poor
mapping and uninformative reads. For more information on choosing the right library preparation
kit
see our
NGS Library Preparation
Guide.
Custom Sequencing Primer Design
There are two main cases when a user must design and synthesize their own custom Illumina
sequencing
primers:
- A non-standard library preparation protocol changes the sequence where a typical Illumina
primer
would bind.
- The initial part of read 1 contains the same sequence across multiple samples. By extending
the
Illumina sequencing primer into this constant region, the user can begin sequencing a
variable
sequence or one of interest. This is a common with amplicon and 16S libraries.
Outside of these cases, the Illumina sequencing primers included in the cluster generation kits
are
sufficient for standard library sequencing.
Proper design of these primer sequences is critical to the success of the sequencing run. Each of
the
following factors needs to be considered when designing a custom sequencing primer:
- The Tm of the custom primer must match the Tm of the sequencing primer it is designed to
replace
- The primer should not form any secondary structures or self-anneal to itself
- It should be designed so that extension occurs in the 5’ to 3’ direction
- Custom primers should be submitted to a service provider at an appropriate volume and
concentration (see below)
- The primer should be synthesized and HPLC purified to remove short or incomplete sequences
Instruments most amenable to custom sequencing primers
You can use a custom sequencing primer for Read 1, Read 2, or for an Index read on MiSeq and
NextSeq
platforms. These sequencing primers can be directly loaded into the reagent cartridge. You also
have
the option to use a combination of custom primers and Illumina primers in the reagent cartridge.
This is achieved by spiking in your custom primer. Instructions for loading custom primers on a
MiSeq or Nextseq can be found below.
Illumina MiSeq
600 µL of custom primer should be prepared at a concentration of 0.5 µM. 600 µL of custom primer
can
be loaded into the reagent cartridge at position 18 for custom Read 1, position 19 for a custom
Index Read, position 20 for custom Read 2. For more instructions, follow Illumina’s guide on
using
custom sequencing primers on the MiSeq:
MiSeq
Custom Primer Guide
Illumina NextSeq
2 mL of custom primer should be prepared at a concentration of 0.3 µM. 2 mL of each custom primer
should be loaded into the reagent cartridge at position 7 for custom Read 1, position 8 for
custom
Read 2, and position 9 for custom index 1 and 2 (1 mL each if using both index 1 and index 2
primers). For more instructions, follow Illumina’s guide on using custom sequencing primers on
the
Nextseq:
NextSeq
Custom Primer Guide
References
- Lander ES, Waterman MS.(1988) Genomic mapping by fingerprinting random clones: a
mathematical
analysis, Genomics 2(3): 231-239.
- Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B (2008) Mapping and quantifying
mammalian
transcriptomes by RNAseq. Nat Methods 5:621–628.
- Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg
SL, Wold BJ, Pachter L. (2010) Transcript assembly and quantification by RNA-Seq
reveals unannotated transcripts and isoform switching during cell differentiation.
Nat Biotechnol 28(5):511–5.
Other Resources: