RS_Long_Amplicon_Analysis Protocol

Use this protocol to determine phased consensus sequences for pooled amplicon data.

Allows for accurate allelic phasing and variant calling in large genomic intervals.
Supports the analysis of novel haplotypes in biomedical loci of interest, such as the HLA region in the human genome.
Can pool up to 5 distinct amplicons. Reads are clustered into high-level groups, then each group is phased and consensus is called using the Quiver algorithm.
If the sample is barcoded, can optionally split reads by barcode.

The protocol includes four main steps:

Coarse clustering: Group reads from different amplicons into different clusters; detect read-read overlaps; build an overlap graph, then cluster the overlap graph to break the graph into the final clusters.
Phasing: Load the reads for each cluster into the Quiver consensus software and find an initial consensus. Recursively split reads from different haplotypes or other PCR products based on high scoring mutations proposed by Quiver.
Consensus: Generate a final consensus for each haplotype or PCR product using Quiver.
Post-Processing Filters: Detect and remove PCR artifacts. Chimeric sequences are identified using the UCHIME algorithm, and other PCR artifacts are identified by overall consensus quality.

Barcode Parameters (Barcode Module v1)

Paired: A pair of barcode sequences that always occur together, and are different on each end.
Symmetric: A single barcode sequence that occurs at both ends of the insert sequence.

Barcode FASTA FIle: Path to the FASTA file containing the barcodes to use.
Minimum Barcode Score: The minimum score for calling a barcode. This must be between 0 and 2 times the length of the barcode.

Amplicon Parameters (LongAmpliconAnalysis v1)

Minimum Subread Length: Subreads shorter than this value are filtered out and excluded from analysis. (This value should be shorter than your shortest amplicon.)
Maximum Number of Subreads: The maximum number of subreads to use for read clustering and consensus. This should be at least 200 times the number of distinct sequences expected. (This is amplicon count times organism ploidy.)
Ignore Primer Sequences When Clustering: Specify the lengths of the primers to ignore when finely clustering subreads. (This is useful when you have excessive splitting caused by mutations at the end of your amplicons; possibly caused by degenerate primers.)
Trim Ends of Sequences: Optionally specify the number of bases to trim from the ends of consensus sequences.
Provide Only The Most Supported Sequences: The number of best-supported sequences to report from each coarse cluster. (0 disables the filter.)
Coarse Cluster Subreads by Gene Family: Specify whether or not to perform Markov clustering of subreads into rough gene families. If this is not set, all subreads are grouped into one gene family.
Phase Alleles: Specify whether or not to separate highly-similar alleles using phasing analysis.
Split Results from Each Barcode into Independent Output Files: Specify whether or not to split the results from each barcode into two separate zip files containing FASTA and FASTQ files.

SMRT® Portal Help