pbalign is a tool for aligning PacBio reads to reference sequences. It is part of the PacBio Bioinformatics tools, and will be bundled in the 2.1 release of SMRTanalysis. You may also follow the instructions below to install pbalign.
Note: the pseudo namespace pbtools has been removed in version 0.2.0.
Note: program name has been changed from `pbalign.py` in version 0.1.0 to `pbalign` in version 0.2.0.
Note: please install this software on an isolated machine that does not have SMRTanalysis installed.
pbalign aligns PacBio reads to reference sequences, filters aligned reads according to user-specific filtering criteria, and converts the output to either the SAM format or PacBio Compare HDF5 (e.g., .cmp.h5) format. The output Compare HDF5 file will be compatible with Quiver if --forQuiver option is specified.
pbalign is available through the pbalign script from the pbalign package. To use pbalign, the following PacBio software is required,
The following software is optionally required if --forQuiver option will be used to convert the output Compare HDF5 file to be compatible with Quiver. - pbh5tools.cmph5tools, a PacBio Bioinformatics tools that manipulates Compare HDF5 files. - h5repack, a HDF5 tool to compress and repack HDF5 files.
The default aligner that pbalign uses is blasr. If you want to use bowtie2 as aligner, then the bowtie2 package also needs to be installed.
If you are within PacBio, these requirements are already installed within the cluster environment.
Otherwise, you will need to install them yourself.
pbalign distinguishes input and output file formats by file extensions.
The input PacBio reads can be in FASTA, Base HDF5, Pulse HDF5, Circular Consensus Sequence (CCS) HDF5 or file or file names (FOFN). The supported input file extensions are as follows.
The input reference sequences can be in a FASTA file or a reference deposit directory created by referenceUploader (a PacBio tool for uploading references to the server and data preprocessing).
The output file can either be a SAM file or a Compare HDF5 file. The output Compare HDF5 file cannot be consumed by Quiver directly unleis --forQuiver option is specified. The supported output file extensions are as follows.
To install Python 2.7, please visit
http://www.python.org/
, or if you have root permission on Ubuntu, execute
sudo apt-get install python
To install pip, please visit
https://pypi.python.org/pypi/pip
, or if you have root permission using Ubuntu, execute
sudo apt-get install python-pip
To install virtualenv, please visit
https://pypi.python.org/pypi/virtualenv
, or execute
pip install virtualenv
To set up a new virtualenv, do
$ cd; virtualenv -p python2.7 --no-site-packages my_env
, and activate the virtualenv using
$ source ~/my_env/bin/activate
To install git, please visit
http://git-scm.com/.
To install blasr, please execute
$ git clone https://github.com/PacificBiosciences/blasr
, and follow instructions at
https://github.com/PacificBiosciences/blasr/blob/master/README.md
Before installing pbcore, you may need to install numpy and h5py from
http://www.numpy.org/
https://code.google.com/p/h5py/
, or if you have root permission on Ubuntu, do
$ git install numpy
$ sudo apt-get install libhdf5-serial-dev
$ git install h5py
To install pbcore, execute
$ pip install git+https://github.com/PacificBiosciences/pbcore
To install pbh5tools, execute
$ pip install git+https://github.com/PacificBiosciences/pbh5tools
To install HDF5 tools, visit
http://www.hdfgroup.org/products/hdf5_tools/
, or if you have root permission on Ubuntu, do
$ sudo apt-get install hdf5-tools
To uninstall pbalign, execute
$ pip uninstall pbalign
To install pbalign, execute
$ pip install git+https://github.com/PacificBiosciences/pbalign
, or to download the whole pbalign package with examples
$ git clone https://github.com/PacificBiosciences/pbalign.git
$ cd pbalign
$ pip install .
Example (1.1)
$ pbalign tests/data/example_read.fasta \
tests/data/example_ref.fasta \
example.sam
Example (1.2)
$ pbalign tests/data/example_read.fasta \
tests/data/example_ref.fasta \
example.cmp.h5
Example (1.3) - with optional arguments
$ pbalign --maxHits 10 --hitPolicy all \
tests/data/example_read.fasta \
tests/data/example_ref.fasta \
example.sam
Example (2.1) - Import pre-defined options from a config File
$ pbalign --configFile=tests/data/1.config \
tests/data/example_read.fasta \
tests/data/example_ref.fasta \
example.sam
Example (2.2) - Pass options through to aligner
$ pbalign --algorithmOptions='-nCandidates 10 -sdpTupleSize 12' \
tests/data/example_read.fasta \
tests/data/example_ref.fasta \
example.sam
Example (2.3) - Create a cmp.h5 file with –forQuiver option
# The output cmp.h5 file will loaded with quality values (pulses)
# from the input bas/bax.h5 file, sorted and repacked, and therefore
# can be consumed by Quiver directly, (Note that in order to use
# --forQuiver option, cmph5tools and h5repack are required.)
$ pbalign --forQuiver your_movie.bas.h5 your_reference.fasta out.cmp.h5
Example (3.1)
$ python
>>> from pbalign.pbalignrunner import PBAlignRunner
>>> # Specify arguments in a list.
>>> args = ['--maxHits', '20', 'tests/data/example_read.fasta',\
... 'tests/data/example_ref.fasta', 'example.sam']
>>> # Create a PBAlignRunner object.
>>> a = PBAlignRunner(args)
>>> # Execute.
>>> exitCode = a.start()
>>> # Show all files used.
>>> print a.fileNames
usage: pbalign [-h] [--verbose] [--version] [--profile] [--debug]
[--regionTable REGIONTABLE] [--configFile CONFIGFILE]
[--algorithm {blasr,bowtie}] [--maxHits MAXHITS]
[--minAnchorSize MINANCHORSIZE]
[--useccs {useccs,useccsall,useccsdenovo}]
[--noSplitSubreads] [--nproc NPROC]
[--algorithmOptions ALGORITHMOPTIONS]
[--maxDivergence MAXDIVERGENCE] [--minAccuracy MINACCURACY]
[--minLength MINLENGTH]
[--scoreFunction {alignerscore,editdist,blasrscore}]
[--scoreCutoff SCORECUTOFF]
[--hitPolicy {randombest,allbest,random,all}] [--forQuiver]
[--seed SEED] [--tmpDir TMPDIR]
inputFileName referencePath outputFileName
Mapping PacBio sequences to references using an algorithm
selected from a selection of supported command-line alignment
algorithms. Input can be a fasta, pls.h5, bas.h5 or ccs.h5
file or a fofn (file of file names). Output is in either
cmp.h5 or sam format.
positional arguments:
inputFileName The input file can be a fasta, pls.h5, bas.h5, ccs.h5
file or a fofn.
referencePath Either a reference fasta file or a reference repository.
outputFileName The output cmp.h5 or sam file.
optional arguments:
-h, --help show this help message and exit
--verbose, -v Set the verbosity level
--version show program's version number and exit
--profile Print runtime profile at exit
--debug Run within a debugger session
--regionTable REGIONTABLE
Specify a region table for filtering reads.
--configFile CONFIGFILE
Specify a set of user-defined argument values.
--algorithm {blasr,bowtie}
Select an aligorithm from ('blasr', 'bowtie').
Default algorithm is blasr.
--maxHits MAXHITS The maximum number of matches of each read to the
reference sequence that will be evaluated. Default
value is 10.
--minAnchorSize MINANCHORSIZE
The minimum anchor size defines the length of the read
that must match against the reference sequence. Default
value is 12.
--useccs {useccs,useccsall,useccsdenovo}
Map the ccsSequence to the genome first, then align
subreads to the interval that the CCS reads mapped to.
useccs: only maps subreads that span the length of
the template.
useccsall: maps all subreads.
useccsdenovo: maps ccs only.
--noSplitSubreads Do not split reads into subreads even if subread
regions are available.
Default value is False.
--nproc NPROC Number of threads. Default value is 8.
--algorithmOptions ALGORITHMOPTIONS
Pass alignment options through.
--maxDivergence MAXDIVERGENCE
The maximum allowed percentage divergence of a read
from the reference sequence. Default value is 30.
--minAccuracy MINACCURACY
The minimum percentage accuracy of alignments that
will be evaluated. Default value is 70.
--minLength MINLENGTH
The minimum aligned read length of alignments that
will be evaluated. Default value is 50.
--scoreFunction {alignerscore,editdist,blasrscore}
Specify a score function for evaluating alignments.
alignerscore : aligner's score in the SAM tag 'as'.
editdist : edit distance between read and reference.
blasrscore : blasr's default score function.
Default value is alignerscore.
--scoreCutoff SCORECUTOFF
The worst score to output an alignment.
--hitPolicy {randombest,allbest,random,all}
Specify a policy for how to treat multiple hit
random : selects a random hit.
all : selects all hits.
allbest : selects all the best score hits.
randombest: selects a random hit from all best
alignment score hits.
Default value is randombest.
--forQuiver The output cmp.h5 file which will be sorted, loaded
with pulse information, and repacked, so that it
can be consumed by quiver directly. This requires
the input file to be in PacBio bas/pls.h5 format.
Default value is False.
--seed SEED Initialize the random number generator with a none-zero
integer. Zero means that current system time is used.
Default value is 1.
--tmpDir TMPDIR Specify a directory for saving temporary files.
Default is /scratch.