While we’ve standardized the way sequencing runs are ordered across all platforms, customization is required when determining the run parameters to achieve your experimental goals. If you’re new to high throughput sequencing and have questions about how you should design your sequencing run, fill out our free consultation form and we'll get in touch with you to help.
We highly recommend that use Genohub's NGS Matching Engine as a great tool to determine the right amount of sequencing capacity on various instruments and easily explore different options. Simply enter your specifications and instantly see services that match your output requirements.
With single read runs the sequencing instrument reads from one end of a fragment to the other end. Paired end runs read from one end to the other end, and then start another round of reading from the opposite end. Single read runs are faster, cheaper and are typically sufficient for profiling or counting studies such as RNA-Seq or ChIP-Seq.
Paired end runs give additional positioning information in the genome, making it a good choice for de novo genome assembly as well as making it easier to resolve structural re-arrangements such as deletions, insertions and inversions. Experiments designed to study splice variants, epigenetic modifications (methylation) and SNP identification are best served by paired-end runs. While paired end runs are more costly and time consuming, you get back twice the amount of data at less than double the cost to sequence.
Some sequencing instruments give you flexibility in choosing the number of base pairs (cycles) you can read at one time. The number of cycles corresponds to the output read length. While longer read lengths give you more accurate information on the relative positions of your bases in a genome, they are more expensive than shorter ones. 50 cycles are typically sufficient for simple mapping of reads to a reference genome, and RNA-Seq profiling or counting experiments. Read lengths greater than or equal to 100 are typically chosen for genome or transcriptome studies that require high amounts of output.
During a DNA sequencing reaction, sequenced base pairs or "reads" are generated. Each sequencing platform and instrument yield different numbers of reads. Genohub's popular NGS Matching Engine automatically calculates the minimum amount of sequencing capacity on each instrument to yield the required number of reads or coverage for your project. The calculations are based on read/unit estimates advertised by instrument manufacturers which are summarized in the table below:
|Platform||Instrument||Unit||Reads / Unit||Reference|
|Illumina||HiSeq X Ten||Lane||375,000,000||1|
|Illumina||HiSeq NextSeq 500 High-Output||Run||400,000,000||2|
|Illumina||HiSeq NextSeq 500 Mid-Output||Run||130,000,000||2|
|Illumina||HiSeq High-Output v4||Lane||250,000,000||3|
|Illumina||HiSeq High-Output v3||Lane||186,048,000||3|
|Illumina||HiSeq Rapid Run||Lane||150,696,000||3|
|Illumina||MiSeq v2 Micro||Lane||4,000,000||5|
|Illumina||MiSeq v2 Nano||Lane||1,000,000||5|
|PacBio||PacBio RS II||SMRT Cell||47,000||7|
|PacBio||PacBio RS||SMRT Cell||22,000||7|
|Roche 454||GS FLX+ / FLX||1 PTP||700,000||8|
|Roche 454||GS FLX+ / FLX||1/2 PTP||350,000||8|
|Roche 454||GS FLX+ / FLX||1/4 PTP||125,000||8|
|Roche 454||GS FLX+ / FLX||1/8 PTP||50,000||8|
|Roche 454||GS FLX+ / FLX||1/16 PTP||20,000||8|
|Roche 454||GS Junior||1 PTP||70,000||8|
A sequencing run generates reads that sample a genome randomly and independently . These reads are not distributed equally across an entire genome; some bases are covered by fewer reads, some by more reads than the average coverage. Coverage refers to the average number of times a single base is read during a sequencing run. If the coverage is 100 X, this means that on average each base was sequenced 100 times. The more frequently a base is sequenced, the more reliable a base is called, resulting in better quality of your data. The Lander / Waterman equation is one method for determining coverage. C=LN/G, where C is coverage, L is read length, N is the number of reads and G is the haploid genome length. Our NGS Matching Engine page takes care of this calculation for you.
Requirements for coverage will depend on your type of study and are commonly set by a scientific body or journal. We recommend reading the ENCODE Consortium guidelines.
|DNA-Seq (Re-Sequencing)||30 - 80X|
|DNA-Seq (De novo assembly)||100X|
|SNP Analysis / Rearrangement Detection||10 - 30X|
|Exome||100 - 200X|
|ChIP-Seq||10 - 40X|
For more examples see the Sequencing Coverage Guide.
Determining the amount of coverage needed for a RNA-Seq experiment is difficult because different transcripts are expressed at different levels, meaning that more reads will be captured from highly expressed genes while fewer reads will be captured by genes expressed at low levels. Transcriptome complexity, alternate expression, 3’ associated biases and the distribution of expression levels make coverage determinations more difficult. A more useful metric for RNA-Seq is determining the total number of mapped reads. It is important to distinguish between total reads and mapped reads, as not all reads will map onto a reference genome, so the number of usable reads will be less than the number of actual reads. The number of reads that will map depend on the library type, quality of sample and how complete your reference genome is. Determining how many reads you need will depend on how sensitive your experiment needs to be for genes expressed at low levels. Standards are ultimately set by the scientific field and those who are publishing work on related transcriptomes. In general, for large genomes we recommend 25-30 million reads for differential expression studies and ~100-200 million reads to examine rare transcripts, splice variant detection and assembly of a de novo transcriptome. See the table below. Standards for RNA-Seq are also provided by the ENCODE Consortium.
Optimal sequencing depth for RNA-Seq will vary based on the scientific objective of study but here are some general recommendations based on sample type and application:
|Sample Type||Reads Needed for Differential Expression (millions)||Reads Needed for Rare Transcript or De Novo Assembly (millions)||Read Length|
|Small Genomes (i.e. Bacteria / Fungi)||5||30 - 65||50 SR or PE for positional info|
|Intermediate Genomes (i.e. Drosophila / C. Elegans)||10||70 - 130||50 – 100 SR or PE for positional info|
|Large Genomes (i.e. Human / Mouse)||15 - 25||100 - 200||>100 SR or PE for positional info|
For more examples see the Sequencing Coverage and Read Depth Guide.
In the end, the goal of expression analysis is to determine the set of all expressed transcripts and their frequencies in a cell at a given time. The number of reads gives us an estimate of the relative expression levels in a cell at a given time. With an accurate measure of transcript length, absolute measurements can be estimated by normalization. One common RNA-Seq measure is reads per kilobase per million reads (RPKM) .
RPKM = (10^9 * C) / (N * L)
C is the number of mappable reads on a feature (transcript, exon, etc.), L is the length of feature (in kb), N is the total number of mappable reads (in millions). Since the average transcript length may vary between samples, transcripts per million (TPM) is also used as an expression measure. When the average transcript length is 1 kb, 1 TPM is equal to 1 RPKM, which is approximately 1 transcript per cell. One disadvantage of this approach is that proportional representation of each gene is dependent on the expression level of all genes. Highly expressed transcripts take up a large proportion of sequence reads. Small expression changes in these reads have an affect on transcripts with low expression. To overcome a high dependence on read counts, a variation of RPKM, fragments per kilobase of exon per million mapped fragments (FPKM)  normalizes for sequencing depth and builds on the assumption that the number of reads generated for a transcript is proportional to abundance and length. This is also an oversimplification as the length distribution of fragmented RNA needs to be considered. With coverage bias, short transcripts are overly represented while long transcripts are underrepresented.
Many improved normalization methods have been developed. Differential expression software packages, such as DESeq use scaling factors and measure the median of the ratio for each gene and read count over the geometric mean across all samples. Several other normalization procedures now use scaling factors to account for variable library sizes.
Replicates are essential in any biological experiment, the same goes for high throughput sequencing. Samples are subject to variation thus making biological replicates important for statistical significance and identifying sources of variation. Despite the desire to cut back on replicates to reduce cost, it’s important to remember that there are many factors which may cause a sequencing run or sample to fail. If you don’t have sufficient replicates, you may have to repeat your sequencing run. In general we recommend at least 4 biological replicates for every experiment.
Randomization is a process of assigning biological samples at random to groups or to different groups within an experiment. This reduces bias by equalizing independent variables that have not been accounted for in the experimental design. Randomization reduces instrument effect, systemic bias and the potential for the occurrence and effect of confounding factors (operational, procedural and person confound). The two main sources of variation that contribute to confounding factors are 1) library effects that occur due to reverse transcription and amplification and 2) unit effects (sequencing lanes [Illumina and SOLiD], chips [Ion], plates [Roche 454]) such as poor base calling, bad sequencing cycles. We recommend randomizing your samples by making sure each sequencing unit contains samples from both control and experimental groups. This can be done by barcoding or indexing your samples to allow for multiplexing.
DNA (or cDNA fragments made from RNA) can be labelled with sample specific sequences or barcodes that allow multiple samples to be included in the same sequencing reaction. Multiplexing allows for proper sample identification after the sequencing run is complete. Multiplexing can be used to create balanced, pooled experimental designs. If you have 8 samples that require the sequencing output obtained from 3 Illumina lanes, unit effects can be eliminated by multiplexing all 8 samples and loading each 8 sample multiplexed pool into all 8 lanes. All unit (lane effects) will be the same for each sample. Multiplexing also has the advantage of eliminating phasing issues related to low multiplex pools. Low multiplexed pools can result in no signal in one of the color channels of an index read. The image registration might fail and no base will be called from that cycle. If a base isn’t called then samples will not be able to be demultiplexed.
When designing your sequencing experiment, careful attention to the information in this guide must be paid to avoid a failed or poor sequencing run. While not completely avoidable, a poor sequencing run will have several of the following non-informing sequencing reads:
While read quality is largely governed by sequencing technology, library preparation can also seriously affect quality and is a major source of coverage variation. Choosing the right library preparation strategy is essential for every sequencing project. Choosing the wrong library preparation approach can cause GC bias, amplification artifacts, unevenness of coverage, poor mapping and uninformative reads. For more information on choosing the right library preparation kit see our NGS Library Preparation Guide.
There are two main cases when a user must design and synthesize their own custom Illumina sequencing primers:
Outside of these cases, the Illumina sequencing primers included in the cluster generation kits are sufficient for standard library sequencing.
Proper design of these primer sequences is critical to the success of the sequencing run. Each of the following factors needs to be considered when designing a custom sequencing primer:
You can use a custom sequencing primer for Read 1, Read 2, or for an Index read on MiSeq and NextSeq platforms. These sequencing primers can be directly loaded into the reagent cartridge. You also have the option to use a combination of custom primers and Illumina primers in the reagent cartridge. This is achieved by spiking in your custom primer. Instructions for loading custom primers on a MiSeq or Nextseq can be found below.
600 µL of custom primer should be prepared at a concentration of 0.5 µM. 600 µL of custom primer can be loaded into the reagent cartridge at position 18 for custom Read 1, position 19 for a custom Index Read, position 20 for custom Read 2. For more instructions, follow Illumina’s guide on using custom sequencing primers on the MiSeq:
2 mL of custom primer should be prepared at a concentration of 0.3 µM. 2 mL of each custom primer should be loaded into the reagent cartridge at position 7 for custom Read 1, position 8 for custom Read 2, and position 9 for custom index 1 and 2 (1 mL each if using both index 1 and index 2 primers). For more instructions, follow Illumina’s guide on using custom sequencing primers on the Nextseq: