Metagenomics refers to both a research technique and research field. Metagenomics, the field can be defined as the genomic analysis of microbial DNA from environmental communities. Metagenomics tools enable the population analysis of un-culturable or previously unknown microbes. This is important as only around 1-2% of bacteria can be cultured in the laboratory (1). The ability to identify microbes without a priori knowledge of what a sample contains is opening new doors in disciplines like microbial ecology, virology, microbiology, environmental sciences and biomedical research. Sequencing based examination of the metagenome has become a powerful tool for generating novel hypotheses.
Shotgun metagenomic sequencing is a relatively new environmental sequencing approach used to examine thousands of organisms in parallel and comprehensively sample all genes, providing insight into community biodiversity and function. Shotgun sequencing allows for the detection of low abundance members of microbial communities.
There are several steps involved in a sequencing based metagenomics project. These include DNA extraction, library preparation, sequencing, assembly, annotation and statistical analysis.
A reproducible method to extract DNA from microbial communities is essential for surveying and whole genome metagenomic analysis. Isolation and extraction must yield high quality nucleic acid for subsequent library preparation and sequencing. Sampling variation can have an effect on comparisons, and abundance measurements. This introduces several challenges as some samples must be delivered anaerobically. Exposure to oxygen or freezing can change the dynamic composition of a given microbial community. For example, freezing, thawing and subsequent bead-beating can affect the cell wall of Gram-positive bacteria, and introduce artifacts compared to extraction performed on fresh samples.
Kits frequently used for DNA extraction from environmental samples include:
If the target community is associated with a host, e.g. human or plant, then physical fractionation or selective lysis can be employed to ensure host DNA is kept to a minimum. Host material can also be removed during bioinformatics filtering and mapping. Regardless of the approach used, it’s important to remember that extraction and isolation methods can introduce bias in terms of microbial diversity, yield and fragment lengths. It’s highly recommended that the exact same extraction method be used when comparing samples.
One of the biggest considerations for library preparation of environmental samples for shotgun metagenomic sequencing has to do with amplification. Certain types of samples (water, swabs) yield small amounts of DNA, necessitating amplification during library preparation. Amplification by PCR can over amplify certain fragments over others confounding abundance and microbial diversity measurements. Often the user does not have a choice when faced with low inputs of DNA. Minimizing variability, constructing libraries together to reduce batch effects and keeping library preparation steps as consistent as possible between samples is good practice. If you’re able to extract enough DNA material (~250 – 500 ng) an amplification-free based library preparation method is recommended. The following library preparation kits are frequently used for metagenomics library preparation:
Shotgun metagenomic sequencing is unique in the sense that you’re trying to sequence a large diverse pool of microbes, each with a different genome size, often mixed with host DNA. Current sequencing technologies offer a wide variety of read lengths and outputs. Illumina sequencing technology offers short reads, 2x250 or 2x300 bp but generates high sequencing depth. Longer reads are preferred as they overcome short contigs and other difficulties during assembly. However instruments that offer longer reads, e.g. PacBio and Oxford Nanopore are accompanied with higher error rates, lower sequencing depth and higher costs. PacBio error rates can be reduced using circular concensus sequencing (CCS) which involves repeat sequencing of a circular template and generation of a DNA insert consensus. High quality 500-4000 bp can be generated with >99% Q20 accuracy.
Not taking costs into consideration and simply evaluating a long PacBio read versus a short Illumina read, with PacBio reads you can expect improved metagenomics assembly statistics and genome binning of difficult to assemble phenotypes. PacBio sequencing is recommended for isolates or in cases when you’re only interested in examining several abundant organisms. Illumina reads are recommended in metagenomics studies where the difference between rare and abundant cells is significant. A compromise many in the field are now using are hybrid Illumina and PacBio reads. Hybrid assemblies using PacBio CCS and HiSeq contigs improve assembly stats, number of contigs and overall contig length. By combining both reads (PacBio and Illumina), you have a higher probability of achieving complete chromosomal closure. Rare microbial species will still have to rely on high depth Illumina sequencing alone for proper assembly.
Assembly involves the merging of reads from the same genome into a single contiguous sequence (contig). Most available tools build upon a traditional de Brujin graph approach to genome assembly. One of the biggest challenges to assembly is the generation of chimeras, where two sequences from different genomes or parts of the genome are incorrectly merged due to similar sequence composition. This is often mitigated by performing a binning step, assigning each metagenomic sequence to a taxonomic group and then assembling each bin independently. This helps reduce data complexity and the chance of chimeras.
Once assembled, genes can be predicted and functionally annotated. Genes are typically predicated in one of three ways: 1) de novo gene prediction, 2) protein family classification, 3) fragment recruitment (binning). Functional annotation is performed by classifying predicted metagenomics proteins into protein families using sequence or hidden Markov models (HMM) databases. Frequently used sequence databases for functional annotation include:
HMM databases for metagenomics analysis are usually limited to Pfam which uses HMM to model protein domains.
Metagenomics is the study of the functional genomes of microbial communities while 16S sequencing offers a phylogenetic survey on the diversity of a single ribosomal gene, 16S rRNA. We feel it’s necessary to explicitly state this as ‘metagenomics’ and ‘16S rRNA’ are often incorrectly used interchangeably.
The 16S rRNA gene is a taxonomic genomic marker that is common to almost all bacteria and archaea. The marker allows one to examine genetic diversity in microbial communities, specifically what microbes are present in a sample. While some estimates of relative abundance within similar samples can be made, drawing conclusions across different sample types is not recommended due to amplification artifacts introduced during PCR.
16S rRNA sequencing is accomplished by designing primers to the entire 16S locus or targeting multiple hypervariable domains within the gene. The nine variable regions of the 16s rRNA gene are flanked by conserved stretches in the majority of bacteria. Conserved regions can be used as targets for PCR primers, designed upstream and downstream of the variable domains. These hypervariable regions provide the species-species signature necessary for identification. After these domains have been amplified, sequencing related primers are either ligated or added by a second PCR step.
Despite the wide use of 16S sequencing several factors limit proper interpretation of data.
As sequencing costs drop, microbiome research is moving from 16S rRNA gene sequencing to more comprehensive functional representations via whole genome or shotgun metagenomics sequencing.
The short answer is there is no easy way to estimate read depth required for shotgun metagenomics sequencing. Environmental samples have a large distribution of species; each species would have to be accounted for individually. You would need to know the number of total species in the sample, the genome size and relative abundance for each species. In most cases this is not possible when you’re sequencing a sample for the first time.
Let’s assume you were dealing with a simple sample that had 10 bacterial species and wanted 100x coverage depth for de no assembly. If your 10 bacterial species had an estimated genome size of 2 Mb, you’d aim for around 2 Gb of sequencing data per sample.
10 dominant bacterial species * 100x * 2 Mb = 2 Gb
This is a very simplistic calculation. In most metagenomics studies there are thousands or millions of species you need to contend with. If you’re sampling microbes where there is a host involved, then you need to consider accounting for all the reads that will be lost to sequencing host DNA. Much of these reads will be removed by mapping, but they still need to be accounted for.
If you need to perform de novo assembly, the best strategy for determining the coverage you need is to just go ahead and sequence a sample on one paired end Illumina HiSeq lane. You can examine the results and determine whether they are sufficient for assembly. If you already have an assembly and need to measure abundances, you can start with one single end Illumina MiSeq or Nextseq lane (~50M single reads).
If you’re performing de novo assembly we recommend you start with a read length of 1x150 or 2x150 bp. While Illumina read lengths go up to 2x300, those are currently reserved for the MiSeq which may not give you the depth needed on a single lane. If you’re only interested in measuring abundances a 1x90 bp (1x75 Nextseq run as 1x90 bp) or 1x100 bp read length should be sufficient.