CloudPhylo Phylogeny reconstruction on large-scale datasets

Manual

CloudPhylo is a tool built on Apache Spark that is capable of processing large-scale datasets for phylogeny reconstruction.

Install from source code

Use SBT to compile CloudPhylo:

sbt assembly

Example

spark-submit --master local\[4\]
             --class cbb.cloudphylo.SparkRunner
             cloudphylo-assembly-1.0.jar
             --charset aa
             --in sample
             -k 6

Check Spark docuement if you are not familia with spark-submit utility.

Input

Directory contianing all sequences in fasta format. File extension .faa for amino acid sequences, .fna for DNA sequences.

Use CloudPhylo in QOMO Platform

We also deploy CloudPhylo in QOMO platform. If you don't have an active access to a private Spark cluster, you may prefer this out-of-box solution to run CloudPhylo.

Use CloudPhylo via Virtualbox image

Download the prebuild image from download link and import the vm to your own Virturalbox installation (Login username: cloudphylo; Password: 123456).

spark-1.6.1-bin-hadoop2.6/bin/spark-submit cloudphylo-assembly-1.0.jar  -i sample/ -k 6 -c aa

Use precomiled JAR

Check the release page.

Command line options


  -i <fasta input> | --in <fasta input>
        Directory containing sequence fasta files
  -o <value> | --out <value>
        Output file prefix
  -c <charset> | --charset <charset>
        dna for DNA or aa for amino acid
  -k <value> | --kmer-size <value>
        k-mer size
  -C <value> | --output-cv <value>
        [optional] Output CV File

Copyright

Orgnization: BIG Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences