Whole Exome Sequencing Guide

Kits, services, analysis, costs and best practices

⬅️ NGS Handbook

Exome is a term used to describe the sum of all regions in the genome comprised of exons. Exons are DNA regions that are transcribed into messenger RNA, as opposed to introns which are removed by splicing proteins.

Exome sequencing is a capture based method developed to identify variants in the coding region of genes that affect protein function. The typical workflow required to sequence and analyze an exome is as follows:

  1. Nucleic acid isolation, also known as sample preparation.
  2. Fragment DNA for capture and short read NGS. See DNA Fragmentation Methods.
  3. Construct a library.
  4. Target and capture exons using biotinylated probes.
  5. Amplify captured targets.
  6. Quality control using qPCR.
  7. Sequence using an instrument with a 2x100 or 2x150 read length.
  8. Analyze captured information, call variants.

While exome capture methods using PCR, hybrid capture and molecular inversion probes exist, the most common and efficient strategies are in-solution capture methods. In-solution capture utilizes pools of oligonucleotides or probes bound to magnetic beads, whose sequence has been designed to hybridize to exon regions. After binding to genomic DNA, these probes are pulled down and washed, allowing exon regions to be selectively sequenced. Several commercial kits for exome capture are described below.

While there are approximately 180,000 exons in the human genome, constituting less than 2% of total sequence, the exome contains ~80-90% of known disease causing variants making it a cost-effective alternative to whole genome sequencing. When performing exome-seq, users should not only consider average on-target coverage but also the local coverage of particular sites of interest. When choosing between exome and whole genome sequencing (WGS), consider that exome sequencing has the advantage that oligonucleotides are designed to particular genomic regions where typical coverage with WGS is not enough for SNP calling. It is also more affordable enabling the analysis of more individuals and populations. With WGS, you can detect variants in regions not covered by exome capture allowing or the identification of structural and non-coding variants associated with disease.

See Genohub's up-to-date list of available Exome sequencing and library prep services

Comparison of Exome Capture Kits

Exome-Seq Kits Targeted Region Number of Probes Probe Type Genomic DNA input required Adapter addition Probe Length (mer) Probe Design Price per capture (negotiable) Designed on build Hybridization time (hours)
Agilent SureSelect XT2 V6 Exome 60 Mb ~758,086 biotinylated cRNA baits 100 ng Ligation 120  Non-overlapping, paired-end reads used to fill gaps $270 GRCh37 (hg19) 16
Agilent SureSelect XT2 V5 Exome 51 Mb ~655,872 biotinylated cRNA baits 100 ng Ligation 120  Non-overlapping, paired-end reads used to fill gaps $200 GRCh37 (hg19) 16
IDT xGEN Exome Panel 39 Mb 429,826 biotinylated DNA baits 500 ng Ligation not described Non-overlapping $250 GRCh37 (hg19) 4
Illumina Nextera Rapid Capture Expanded Exome 62 Mb >340,000 biotinylated DNA bait 50 ng Transposase 95 Non-overlapping (adjacent to each other) $250 GRCh37 (hg19) 24-48
Roche Nimblegen SeqCap EZ Exome v3.0 64 Mb >2,100,000
biotinylated DNA bait 1 ug
Ligation 60 - 90  Overlapping baits $600 GRCh37 (hg19) 72

Exome Kit Descriptions and Protocol Overview

Agilent HaloPlex

The HaloPlex Exome Kit contains 2.5 million probes designed to cover human coding regions. The kit is designed for targeting smaller capture regions in a quick amount of time. The target size is 37 MB, it’s designed for 21,522 genes and targets 357,999 exons. The overall work flow takes 1.5 days and requires 250 ng of input DNA.

Protocol Overview:

  1. The procedures begins with a genomic DNA digest using restriction enzymes
  2. Hybridization of HaloPlex probe library to DNA digests. Hybridization results in genomic DNA fragment circularization and incorporation of Illumina indexes and flow cell binding motifs
  3. DNA probe hybrids are captured using streptavidin coated magnetic beads binding to biotinylated probe DNA
  4. Targeted fragments are amplified, producing a sequence ready, target enriched library

Agilent SureSelect

The Agilent SureSelect Exome kits contain an in-solution capture method which utilize long 120 mer, biotinylated cRNA baits for enriching exome regions from genomic DNA fragments.

Protocol Overview:

  1. Starting with genomic DNA, samples are sheared resulting in small DNA fragments
  2. Libraries are prepared with Illumina compatible adapters and indices
  3. Biotinylated cRNA baits are incubated with the library for 16 hours
  4. Targeted regions are selected using magnetic streptavidin beads
  5. Targeted regions are amplified, producing a sequence ready library

Agilent SureSelect QXT

The SureSelect QXT kit combines a transposase based library preparation method with SureSelect’s well characterized target capture system. Enrichment probes are 120 nt long baits that allow the capture of exomes, gene panels or custom targets. The biggest advantage of QXT is the hybridization time, only 90 minutes. The protocol is similar to SureSelect (described above), but only requires 50 ng of starting material and takes less than one day to complete. As of 9/2014, there hasn’t been a publication we know of that has used this kit and it hasn’t been compared head to head as others have in the references below. If you’ve used this kit, we’d love to hear your feedback.

IDT xGEN Exome

The IDT xGEN Exome panel consists of 429,826 biotinylated probes designed to capture 19,396 genes (a 39 Mb region). The probes are individually synthesized and quality controlled as opposed to array synthesized probes, reducing truncations in the pools. While the protocol requires at least 500 ng of a constructed library, hybridization times are relatively short (4 hours) compared to competitor exome panels. The xGEN panel only targets coding sequences (CDS) in the RefSeq database.

Protocol Overview:

  1. Blocking oligos are prepared, combined with a DNA library and dried down
  2. 5' biotinylated capture probes are added and hybridized to the library for 4 hours
  3. Streptavidin beads are prepared and added to the hybridized targets
  4. Unbound DNA is removed by washing
  5. Remaining fragments are amplified, enriching targeted regions

Illumina Nextera Rapid Capture Exome

Nextera exome kits come in two different panels: Nextera Rapid Capture Exome and Nextera Rapid Capture Expanded Exome. The former hybridizes to 45 Mb of targeted coding sequence while the expanded panel delivers 62 Mb of exons, untranslated regions (UTRs) and miRNA. The entire library prep and hybridization capture requires only 50 ng of genomic DNA and takes 5 hours to complete, making it one of the more user-friendly exome capture kits.

Protocol Overview:

  1. Libraries are prepared from as little as 50 ng of genomic material with Nextera's transposase based chemistry
  2. DNA libraries are denatured
  3. Denatured single stranded DNA is hybridized to biotinylated probes designed to regions on the exome
  4. Fragments are enriched with streptavidin beads and eluted
  5. Fragments are amplified, producing sequence ready enriched targets

Illumina TruSeq Exome

The TruSeq Exome Enrichment Kits are an in-solution sequence capture system designed for isolating human exon regions. Kits include 340,427, 95 mer probes constructed against the human NCBI37/hg 19 reference genome. The probe set is designed to enrich 201,121 exons spanning 20,794 genes of interest. The kit covers 64 Mb of the human genome, with each 95 mer probe targeting libraries that are 300-400 bp. In addition to major exon regions, the kit provides coverage of non-coding DNA in exon flanking regions, including UTRs and promoters.

Protocol Overview:

  1. The workflow begins with creating pooled indexed libraries from up to 6 samples
  2. Sample libraries are denatured into single stranded DNA and hybridized to 95 mer biotin labelled probes
  3. Pools are then enriched with streptavidin magnetic beads and pulled from solution using a magnet
  4. Enriched DNA fragments are eluted from the beads and hybridized for a second enrichment reaction
  5. Fragments are amplified, producing sequence ready enriched targets

Roche Nimblegen SeqCap

SeqCap EZ Exome Library Kits utilize an in-solution based capture method to enable enrichment of the entire human exome. The kit targets 64 Mb of the human genome using 2.1 million long oligonucleotides.

Protocol Overview:

  1. The protocol begins with genomic DNA library preparation
  2. The library is hybridized to the SeqCap oligo pool
  3. The sequence library is hybridized to the oligo pool and magnetic beads are used to pull down the captured genomic DNA fragments
  4. Enriched DNA fragments are eluted from the beads and hybridized for a second enrichment reaction
  5. Unbound fragments are washed and the enriched fragment pool is amplified, producing sequence ready enriched targets

MYcroarray MYbaits

MYbaits are a fully customizable in-solution DNA capture system using custom biotinylated RNA baits. Baitsets are priced in modules of 20,000 unique bait sequences, which is particularly affordable for designs targeting hundreds or thousands of loci. MYbaits kits are compatible with any type of barcoded NGS library.

Protocol Overview:

  1. A genomic DNA library is heat denatured and hybridized to the RNA baits, typically overnight (or longer for degraded or very rare targets)
  2. After hybridization, the biotinylated baits bound to captured genomic DNA are pulled out of solution using streptavidin-coated magnetic beads
  3. Non-specifically bound DNA is washed away and captured DNA is eluted from the RNA baits using heat
  4. Post-captured DNA is amplified, producing sequence ready enriched targets

Factors Affecting Capture Efficiency

Unlike with PCR amplification where two primers must anneal near a target in the genome and amplify it with specificity, exon capture involves hybridization of a single probe, usually attached to a magnetic bead. After hybridization, you rely on the pulldown of that target, followed by amplification. There is significantly more variability with hybridization capture, which results in on-target rates that are typically no better than 75% (see calculating on-target rates below). Several other factors affect efficiency of capture, those include:

  1. GC rich regions - UTRs and promoter regions are typically very GC rich, often lowering capture efficiency and increasing variability between these regions and other more balanced ones.
  2. Quality of DNA - Poor quality DNA, typical with extractions from FFPE, can introduce bias as certain regions tend to be more fragmented than others. If capture isn't balanced, this results in bias and complications during down stream SNP calling and others forms of analysis.
  3. Quantity of DNA - Low input amounts of DNA usually require a lot more PCR cycles in order to get enough library for the hybridization of capture probes to be efficient. Higher PCR cycles can result in a significant amount of PCR duplicates, making conclusions from data analysis less informative. While traditionally 1 ug of a reasonably diverse library was required for capture, newer capture protocols now only require 50 ng.
  4. Pseudogenes - Can reduce evenness of coverage.
  5. Fragment or insert size - The inserts required from the kits listed above vary based on the size of the probes being used in capture. It is important fragmentation of DNA be tuned to meet those size requirements for efficient capture.
  6. Repeat elements - Will reduce the evenness with which reads are distributed across the exome, resulting in the need for more sequencing to call de novo SNPs.

It is important to remember regardless of the kit you end up choosing, variability in capture efficiency and coverage is inherent in exome sequencing. That being said, many of these variables are controllable. Whenever you improve sequence depth, breath and evenness of coverage, de novo variants are a lot easier to call. In the next few sections we highlight how to calculate the amount of sequencing you need for your exome-seq project and go into detail about the capture kits commercially available.

If you're looking for exome-seq services, start your search on our NGS Matching Engine. An experienced Genohuob scientist will help you through choosing the right kit and coverage for your application.

Calculating the Amount of Sequencing You Need for Your Exome Study

Calculate On-Target Rate or Enrichment Efficiency
Enrichment efficiency = Passing Filter (PF) reads mapped to target / Total number of PF reads mapped to reference.

If using the correct blocking oligos, this rate usually ranges from 0.65 – 0.75.

Calculate Mean Normalized Coverage

Mean normalized coverage is equivalent to how much sequencing is required to yield a given percentage of targeted bases at a particular read depth.

Normalized Coverage = Coverage at each base position / Average Coverage over all base positions 
Determine How Much Exome Sequencing Data You Need

First, identify the mean sequencing coverage required.

Mean Sequencing Coverage = Desired Coverage / Normalized Coverage.

Let’s say your desired coverage is 20X and mean normalized coverage is 0.2. The mean sequencing coverage would be 20X / 0.2 = 100X.

Determine Amount of Sequencing You Need to Meet Your Coverage Requirement
Required Number of PF Mapped Sequence = (Targeted Bases) x (Mean Sequencing Coverage) / Enrichment Efficiency. 

Let’s assume you’re using the Roche Nimblegen SeqCap EZ Exome v3.0 kit, which has a targeted region of 64 Mb (see table above). The mean sequencing coverage you require, calculated above is 100X, and let’s say your on-target rate or enrichment efficiency is ~0.70.

So, (64Mb)(100X) / 0.70 = 9.1 Gb of passing filter and mapped sequence.

Recommended Read Length for your Exome Study

While not a hard requirement, we generally recommend paired-end 2x100 read lengths for exome capture sequencing. ~80% of exons are <200 bp in length (Sakharkar, 2004), so a 2x100 read should be ideal for most experiments.

Basic Recommendations for Calling Variants and Analyzing Depth of Coverage

Exome variant analysis recommendations

  1. Trim adapter sequences using cut-adapt
    • Input files: FASTA, FASTQ
    • Output files: same as input
  2. Align reads to a human genome build using BWA, post-process data using SAMtools, remove duplicate reads and convert to a BAM file using Picard
    • Input files: FASTA, FASTQ
    • Output files: BAM
  3. Call SNPs using GATK HaplotypeCaller to produce a VCF file and call germline SNPs and indels
    • Input files: BAM
    • Output: VCF

Exome coverage analysis recommendations

  1. Use an existing exome BED file or use Bedtools to create intervals in the genomic regions you’d like to analyze
  2. Compute read depth of your genomic intervals using GATK DepthOfCoverage

Off-Target Reads (reads that don't align to the target region)

Exome sequencing tends to produce a large portion of off-target reads. In some cases as much as 40% of the data produced by capture methods are not of exon regions. The three main types of off-target reads include:

  1. Intron and intergenic reads
  2. Viral and bacterial genome reads
  3. Mitochondrial reads

Poorly designed biotinylated probes (baits), fragmented input DNA, small capture regions, inefficient hybridization conditions, and spurious adapter to adapter annealing are the reasons such a large portion of exome reads can end up being off-target. When calculating the number of reads required to meet mean on target exon coverage, off-target rates need to be considered.

Off-target reads can include functionally important genomic regions such as promoters, conserved non-coding sequences, untranslated regions (UTRs) and microRNA. A recent review (Samuels et al.) suggests data mining and extracting useful information from non-exonic reads is valuable. Tools for analyzing these off-target reads are also described.

Considerations for Whole Exome Sequencing:

1. What sequencing instrument and read length should I choose for exome-seq?

To obtain coverage typically required for whole human exome sequencing, we recommend the Illumina HiSeq or Nextseq 500 platforms. You should use a minimum read length of 2x100, or 200 bp. While sequencing at a longer read length gives you slight improvements in single nucleotide variant (SNV) calls, the improvements are modest and not necessary worth the extra time, cost. Similarly for INDELs, the number of variants identified do not increase significantly beyond a 2x100 bp length sequencing run. Certainly, longer read lengths do increase mean coverage. Improved coverage using longer reads increases with greater sequencing depth. So if you’re trying to obtain 200x coverage, consider a longer read length. If you’re aiming for 100x coverage, unless the costs are relatively the same, we recommend starting with a 2x100 bp sequencing run.

2. How much sequencing coverage do I need for exome sequencing?

For accurate variant identification we recommend at least 100x coverage. In many cases higher coverage will reduce the chance of a false positive SNV call. Higher coverage will also be necessary to increase sensitivity and improves the detection of rare variants. If you need advice on the depth of sequencing required for your exome project, we’re happy to offer advice. Fill our our complimentary consultation form and describe your exome sequencing project. A member of the Genohub scientific staff will assist you.

3. How do I calculate the sequencing coverage or depth required for my whole exome sequencing study?

Several factors must be considered when calculating coverage for exome sequencing. These include:

  1. Average on target rate or probe enrichment efficiency
  2. Mean normalized coverage
  3. Desired coverage vs. mean coverage

See the detailed example in calculating exome coverage.

4. Which exome sequencing capture kit should I use for my study?

We’ve summarized the specifications for each capture kit in this table. In general, we’ve found the Roche Nimblegen SeqCap v3.0 + UTR kit contains the largest designed target and coding regions. Nimblegen SeqCap v3.0 and Agilent SureSelect XT2 v5 +UTR exome capture kits also tend to have lower off target enrichment compared to Illumina Nextera Rapid Capture (expanded exome). The Agilent SureSelect XT2 v5 + UTR exome kits generally have been shown to have the highest accuracy in SNV detection and best GC-rich region enrichment.

While we’ve just made some recommendations above, all three kits are frequently used in many facilities and results with each will likely be adequate for most applications. In many cases the amount of DNA you have will be a deciding factor on the kit that you should use. With some kits you can start with as little as 50 ng of input material. Others require at least 1 ug. Most exome capture kits allow you to barcode and pool your samples prior to capture, reducing capture reagent costs. Several older kit versions are available to users who prefer capturing each barcoded library individually.

5. How can I compare the annotation and exome capture design between each kit?

You can download annotated files providing information on genomic regions covered by the capture probes and genes included in these regions:

6. Should I choose Whole Genome Sequencing (WGS) or Whole Exome Sequencing (WES) for my project?

There are advantages/disadvantges for both WGS and WES. With the decreasing cost of WGS, these need to be carefully considered. We've laid out each of these in an earlier blog post: Whole Genome Sequencing (WGS) vs. Whole Exome Sequencing (WES).

Exome Sequencing Service Categories:

A. Standard Exome Sequencing Services - Prices Range from $550 - $800 per sample
  1. 100x sequencing coverage
  2. Appropriate for non-cancer based applications
  3. Transposase or Ligation based library preparation
  4. Options to choose between IDT xGEN and Illumina Nextera Rapid Capture

Search for Standard Whole Exome Sequencing Services

B. Deep or High Coverage Exome Sequencing Service - Prices range from $760 - $1,800 per sample
  1. 200x sequencing coverage
  2. Appropriate for case samples or tumor normal pairs
  3. Ligation based library preparation
  4. Options to choose between Agilent SureSelect, NimbleGen SeqCap, IDT and Illumina Nextera

Search for High Depth Whole Exome Sequencing Services

C. Clinical Grade Whole Human Exome Sequencing - Prices range from $850 - $1,800 per sample
  1. Performed in a CLIA certified facility

Search for Clinical Whole Exome Sequencing Services

Published References Comparing Exome Capture Kits:

  1. Asan et al: Comprehensive comparison of three commercial human whole-exome capture platforms. Genome Biol. 2011 Sep 28;12(9):R95.
  2. Bainbridge, M.N. et al., 2010. Whole exome capture in solution with 3Gbp of data. Genome Biology.
  3. Clark et al: Performance comparison of exome DNA sequencing technologies. Nat Biotechnol. 2011 Sep 25. doi: 10.1038/nbt.1975
  4. Kingsmore, S.F. & Saunders, C.J., 2011. Deep Sequencing of Patient Genomes for Disease Diagnosis : When Will It Become Routine? ScienceTranslationalMedicine, 3(87), p.1-4. Review of Bainbridge et al and discussion of WGS and targeted or Exome-Seq. They also suggest that an exome costs 5-15 fold less that a WGS.
  5. Maxmen, A., 2011. Exome Sequencing Deciphers Rare Diseases. Cell, 144, p.635-637. A review of the undiagnosed Diseases Program at NIH. Exome-Seq and high-resolution microarrays for genotyping. They mention the team’s first reported discovery of a new disease, which was published in The New England Journal of Medicine.
  6. Natsoulis, G. et al., 2011. A Flexible Approach for Highly Multiplexed Candidate Gene Targeted Resequencing. PloS one, 6(6).
  7. Parla et al: A comparative analysis of exome capture. Genome Biol. 2011 Sep 29;12(9):R97.
  8. Sakharkar et al: Distributions of exons and introns in the human genome. In Silico Biol. 2004; 4(4):387-93.
  9. Samuels, D.C., 2013. Finding the lost treasures in exome sequencing data. Trends in Genetics, 29, p.593-599.
  10. Sulonen et al: Comparison of solution-based exome capture methods for next generation sequencing. Genome Biol. 2011 Sep 28;12(9):R94.