CloudPhylo: Spark-based Phylogeny Reconstruction Tool Scalable for Big Data Analysis

Phylogeny reconstruction is a routine analysis for most evolutionary related studies, determining and picturing evolutionary relationships among many genes or species. However, most existing tools for phylogeny reconstruction are simply based on single process model or traditional parallel paradigms, such as PThread, OpenMP etc., and therefore, cannot scale well with the dramatically increasing size of input dataset. To tackle this challenge, BIGD (Big Data Center) presents a Spark-based tool, CloudPhylo, to handle large dataset for fast and scalable phylogeny reconstruction Spark is a newly proposed cloud computing framework, which incorporates MapReduce paradigm and efficiently caches internal calculation results, significantly boosting the performance of CloudPhylo and enabling CloudPhylo to be used for largescale phylogenetic tree inference.

CloudPhylo is not only the world’s first phylogeny reconstruction tool available for large-scale dataset, but also the first Spark-based bioinformatics software in China. According to the comparison results, CloudPhylo achieves high efficiency and good scalability, and is well suited for largescale phylogenetic tree inference (Figure 1). In order to make CloudPhylo more accessible and usable for researchers, we deployed CloudPhylo in Qomo Cloud Platform (http://bigd.big.ac.cn/cloud/tools/QT000010) of BIGD. This work has been published in the journal of Bioinformatics (https://www.ncbi.nlm.nih.gov/pubmed/27742698).

This work is supported by National Programs for High Technology Research and Development [2014AA021503 and 2015AA020108], and International Partnership Program of Chinese Academy of Sciences [153F11KYSB2016008].

Paper link: https://www.ncbi.nlm.nih.gov/pubmed/27742698