RS_Long_Amplicon_Analysis Protocol
Use this protocol to determine phased consensus sequences for pooled
amplicon data.
- Allows for accurate allelic phasing and variant calling in large
genomic intervals.
- Supports the analysis of novel haplotypes in biomedical loci of
interest, such as the HLA region in the human genome.
Can pool up to 5 distinct amplicons.
Reads are clustered into high-level groups, then each group is phased
and consensus is called using the Quiver algorithm.
If the sample is barcoded,
can optionally split reads by barcode.
The protocol includes four main steps:
Coarse
clustering: Group reads from different amplicons into different
clusters; detect read-read overlaps; build an overlap graph, then
cluster the overlap graph to break the graph into the final clusters.
Phasing:
Load the reads for each cluster into the Quiver consensus
software and find an initial consensus. Recursively split reads from
different haplotypes or other PCR products based on high scoring mutations
proposed by Quiver.
Consensus:
Generate a final consensus for each haplotype or PCR product using Quiver.
Post-Processing
Filters: Detect and remove PCR artifacts. Chimeric
sequences are identified using the UCHIME algorithm, and other PCR
artifacts are identified by overall consensus quality.
Barcode
Parameters (Barcode Module v1)
- My Library has DNA Barcodes That
Are:
- Paired: A pair of barcode
sequences that always occur together, and are different
on each end.
- Symmetric: A single barcode sequence that
occurs at both ends of
the insert sequence.
- Barcode FASTA FIle: Path
to the FASTA file containing the barcodes to use.
- Minimum Barcode Score:
The minimum score for calling a barcode. This must be between 0 and
2 times the length of the barcode.
Amplicon Parameters (LongAmpliconAnalysis
v1)
- Minimum Subread Length:
Subreads shorter than this
value are filtered out and excluded from analysis. (This value should
be shorter than your shortest
amplicon.)
- Maximum Number of Subreads:
The maximum number of subreads to use for read clustering and consensus.
This should be at least 200 times the number of distinct sequences
expected. (This is amplicon count times organism ploidy.)
- Ignore Primer Sequences When Clustering:
Specify the lengths of the primers to ignore when finely clustering
subreads. (This is useful when you have excessive splitting caused
by mutations at the end of your amplicons; possibly caused by degenerate
primers.)
- Trim Ends of Sequences:
Optionally specify the number of bases to trim from the ends of consensus
sequences.
- Provide Only The Most Supported
Sequences: The number of best-supported sequences to
report from each coarse cluster.
(0
disables the filter.)
- Coarse Cluster Subreads by Gene
Family: Specify whether or not to perform Markov clustering
of subreads into rough gene families. If this is not
set, all subreads are grouped into one
gene family.
- Phase Alleles: Specify
whether or not to separate highly-similar alleles using phasing analysis.
- Split Results from Each Barcode
into Independent Output Files: Specify whether or not to split
the results from each barcode
into two separate zip files containing FASTA and FASTQ files.