Circular Consensus (CCS, aka “Reads of insert”) specification

Overview

Circular consensus sequencing (CCS) calculates consensus sequences from multiple “passes” around a circularized single DNA molecule (SMRTbell). CCS uses the Quiver framework to achieve optimal consensus results given the number of passes availalble.

Inputs

The Circular Consensus module reads basecall data directly from bax.h5 files (formerly, bas.h5 files) supplied on the command-line or via SMRT Portal. Adapter annotations in the bax.h5 file are used to find the subread intervals of the raw read.

Outputs

The Circular Consensus module emits standard FASTA/Q files and a ccs.h5 file (similar to a bax.h5 file) containing a sequence for each ZMW whose consensus sequence meets the quality filters. The filters ensure that the read made a minimum number of complete passes over the insert sequence (default is a minimum of one full pass), and that the expected accuracy of the consensus sequence exceeds some value (default is 90% accuracy).

If the user provides a file named “m130524_055855_sherri_c100525322550000001823081109281363_s1_p0.1.bax.h5” is used as input to P_CCS, the workflow will output files named:

  • “m130524_055855_sherri_c100525322550000001823081109281363_s1_p0.1.fasta”
  • “m130524_055855_sherri_c100525322550000001823081109281363_s1_p0.1.fastq”
  • “m130524_055855_sherri_c100525322550000001823081109281363_s1_p0.1.ccs.h5”

in the job output data directory.

Command-line interface

The command line interface to invoke CCS within a SMRTanalysis installation is:

ConsensusTools.sh CircularConsensus
      -q <outFastq>          # default <outFastq> = <file>.fastq
      -f <outFasta>          # default <outFasta> = <file>.fasta
      --h5 <outH5>           # default <outH5>    = <file>.ccs.h5
      -n <numWorkers>        # Number of threads to use when processing ZMWs
      -c <chemistryMapping>  # Chemistry mapping xml file
      --minPredictedAccuracy # Requested accuracy threshold
      --minFullPasses        # Requested minimum number of passes
      <file>.bax.h5          # Input bax.h5 file

Instead of a single bax.h5, one can also provide a file of file names (“FOFN”) of individual bax.h5 files using the --fofn flag.

The chemistry_mapping.xml file is produced by SMRT Pipe and contains information about the sequencing chemistry used to generate the data. The tool will autmatically select the appropriate Quiver parameters based on this information.

In this example the bax.h5 file listed will be used as input, and FASTA, FASTQ, and ccs.h5 output files will be produced.

Algorithm Description

The PacBio RS produces sequencing data by reading a circular SMRTbell molecule containing the insert DNA of insert, flanked by hairpin adapters. During primary analysis, the raw read (“polymerase read”) is segmented by identifying the locations of adapter sequence. The segments between adapter hits–corresponding to the insert DNA sequence–are excised as “subreads” and are used as the starting point for CCS analysis.

A subread is sometimes termed a “pass”, as well, because it represents the sequence read from a single pass of the polymerase across the insert sequence. A subread is called a “full pass” if it is flanked on both ends by adapter sequence. Otherwise it is called a “partial pass.”

The subreads intervals of the raw read are determined using the adapter annotations stored in the bax.h5 file. The subreads are loaded into the Quiver consensus calling framework, which iteratively refines the consensus sequence using a PacBio specific error model, and rich per-base QVs (InsertionQV, DeletionQV, MergeQV) contained in the bax.h5 file. For more details on the Quiver method, see our HGAP publication here:

http://www.nature.com/nmeth/journal/v10/n6/full/nmeth.2474.html

A note regarding subreads and --minFullPasses

The --minFullPasses flag allows control over how many (full pass) subreads are required in order for a consensus read to be output.

  • --minFullPasses=0 allows consensus to be produced for molecules with as little as one partial subread
  • --minFullPasses=1 allows consensus to be produced for molecules with as little as one full subread
  • --minFullPasses=2 requires at least two full subreads... and so on.

Consensus for a single subread is the subread itself, just as the mean of a list containing one item is just the item itself.

Note that --minFullPasses is just one among many filters available. Using --minFullPasses={0,1} may not result in any single subread consensus reads if the --minPredictedAccuracy filter is set higher than the average single pass accuracy of the sequencing chemistry.

SMRT Portal interface

Users of SMRT Portal can interface with CCS via the RS_ReadsOfInsert.1.xml protocol. Additionally, SMRT Portal provides a protocol called RS_ReadsOfInsert_Mapping.1.xml, which performs a subsequent mapping step.

SMRT Portal parameters

The following parameters are exposed in SMRTportal:

  • minFullPasses, allowing the user to specify the minimum complete subreads required to produce an output consensus read.
  • minPredictedAccuracy, allowing the user to provide a minimum percentage estimated accuracy for a consensus read to be outputted.
  • minLength, and maxLength, for specifying the maximum/minimum length of output CCS reads

SMRT Portal reports

The reports generated in SMRTportal for a CCS analysis include:

  • a summary table of mean insert length, mean number of passes, and mean estimated consensus accuracy, for each movie
  • a histogram of read lengths
  • a histogram of estimated consensus accuracy (“read quality”)
  • a histogram of the number of passes (subreads) per CCS read.

SMRTpipe interface

Users of SMRTpipe can interface with CCS via the P_CCS module.