RS_HGAP_Assembly.3 Protocol
Use this protocol to perform high quality de
novo assembly using a single PacBio library prep. (The protocol
is optimized for speed, and is
faster than RS_HGAP_Assembly.2 during assembly.)
Incorporates a 10-fold speed improvement
for microbial assembly.
Consists of pre-assembly, de novo assembly with PacBio's
AssembleUnitig, and assembly
polishing with Quiver. (PacBio's
AssembleUnitig module replaces the most time-consuming
step in Celera® Assembler.)
Filtering
Parameters (PreAssembler Filter v1)
- Minimum Subread Length:
Subreads shorter than this
value (in base pairs) are filtered out and excluded from analysis.
- Minimum Polymerase Read Quality:
Polymerase reads with lower quality
than this value are filtered out and excluded from analysis.
- Minimum Polymerase Read Length:
Polymerase reads shorter than
this value (in base pairs) are filtered out and excluded from analysis.
Assembly
Parameters (PreAssembler v2)
- Compute Minimum Seed Read Length:
Specify whether or not to compute the minimum
seed read length that results in at least 30X target genome coverage,
by the longest subreads. This is based on the genome size you specified.
- Minimum Seed Read Length:
The minimum length of reads (in base pairs) to use as seeds for pre-assembly.
- Number of Seed Read Chunks: The
number of pieces to split the data files into while running PreAssembler.
- Alignment Candidates Per Chunk:
The number of alignments to consider for each
read for a particular chunk.
- Total Alignment Candidates:
The number of potential alignments BLASR should consider across all chunks for a particular read.
- Minimum Coverage for Correction:
The minimum coverage to maintain correction for a read. If the coverage
falls below that threshold,
the read will be broken at that junction. Use this option to obtain
longer corrected reads by relaxing the constraint, but it carries
the risk of allowing more chimeras through to assembly.
- BLASR Options (Advanced):
-bestn
and -nCandidates
values should be roughly equal to the expected seed read coverage.
Assembly
Parameters (AssembleUnitig v1)
- Genome Size (Bp): The approximate
genome size, in base pairs.
- Target Coverage: Fold coverage
to target for when picking the minimum fragment length for assembly.
(This value is typically 15 to 25).
- Overlapper Error Rate:
Trimming and assembly overlaps above this error limit won't be detected.
- Overlapper Min Length:
Overlaps shorter than this
length (in base pairs) are not computed.
- Overlapper K-Mer: The length
of the seeds (in base pairs) used by the seed-and-extend algorithm.
- Pre-defined Spec File:
The path to an existing specification file used to run the assembly
program. (The P_CeleraAssembler module auto-generates the specification
file based on the input data and selected parameters, but you can
also provide an explicit specification file.)
Mapping
Parameters (BLASR v1)
- Maximum Divergence (%):
The maximum allowed divergence of a read from the reference sequence.
- Minimum Anchor Size: The
minimum size of the read (in base pairs) that must match against the
reference sequence.
- Write Output As a BAM File:
Specify whether or not to output a BAM representation of the cmp.h5
file.
- Write BED Coverage File:
Specify whether or not to output a BED representation of the depth
of coverage summary.
- Place Repeats Randomly:
Specify that if BLASR maps a read to more than one location with equal
probability, then it randomly selects which location it chooses as
the best location. If not
set, BLASR defaults to the first on the list of matches.
- Advanced Pbalign Options:
Allows you to pass non-standard parameters to the underlying pbalign.py script.
Use this option with care!
Consensus
Parameters (AssemblyPolishing v1 Quiver)
- Use Only Unambiguously Mapped
Reads: Specify whether or not to filter out the reads where
Map QV is less than 10. This reduces coverage in repeat regions that
are shorter than the read length. You might want to uncheck this option
in de novo assembly projects,
but we recommended that you leave the option checked
for variant calling applications. (Map QV is a single-scalar Phred-scaled
QV per aligned read; it reflects the mapper's degree of certainty
that a read aligned to this part of the reference and not some other.
Unambiguously mapped reads have a high
Map QV; reads that are equally likely to have come from two parts
of the reference have a low
Map QV.)