CloudPhylo Phylogeny reconstruction on large-scale datasets
Manual
CloudPhylo is a tool built on Apache Spark that is capable of processing large-scale datasets for phylogeny reconstruction.
Install from source code
Use SBT to compile CloudPhylo:
sbt assembly
Example
spark-submit --master local\[4\]
--class cbb.cloudphylo.SparkRunner
cloudphylo-assembly-1.0.jar
--charset aa
--in sample
-k 6
Check Spark docuement if you are not familia with spark-submit
utility.
Input
Directory contianing all sequences in fasta format. File extension .faa
for amino acid sequences, .fna
for DNA sequences.
Use CloudPhylo in QOMO Platform
We also deploy CloudPhylo in QOMO platform. If you don't have an active access to a private Spark cluster, you may prefer this out-of-box solution to run CloudPhylo.
Use CloudPhylo via Virtualbox image
Download the prebuild image from download link and import the vm to your own Virturalbox installation (Login username: cloudphylo; Password: 123456).
spark-1.6.1-bin-hadoop2.6/bin/spark-submit cloudphylo-assembly-1.0.jar -i sample/ -k 6 -c aa
Use precomiled JAR
Check the release page.
Command line options
-i <fasta input> | --in <fasta input>
Directory containing sequence fasta files
-o <value> | --out <value>
Output file prefix
-c <charset> | --charset <charset>
dna for DNA or aa for amino acid
-k <value> | --kmer-size <value>
k-mer size
-C <value> | --output-cv <value>
[optional] Output CV File
Copyright
Orgnization: BIG Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences