Document & Usage

ParaAT (Parallel Alignment and back-Translation) is capable of manipulating a large number of homologous groups. ParaAT features parallel construction of multiple protein-coding DNA alignments and thus is well suited for large-scale data analysis in the high-throughput era.

ParaAT involves two parallel regions, in which the master thread creates a team of slave threads running concurrently. In the first parallel region, ParaAT aligns protein sequences for multiple homologs by assigning each homolog to one of the slave threads. In the second region, ParaAT parallelly back-translates protein sequence alignments into the corresponding DNA alignments.

ParaAT is well suited for large-scale data analysis in the high-throughput era, providing good scalability and exhibiting high parallel efficiency for computationally demanding tasks.

Usage Notes

Parameter settings

-h, -homolog Homolog group file [string, required]
-a, -aminoacid File containing multiple amino acid sequences [string, required]
-n, -nuc File containing multiple nucleotide sequences [string, required]
-p, -processor File containing the number of processes, changeable during running [string, required]
-o, -output Output folder [string, required]
-m, -msa Multiple sequence aligner or specify with its full path (clustalw2 | t_coffee | mafft | muscle, optional)
-f, -format Output file format (axt | fasta | paml | codon | clustal, optional), default = fasta
-c, -code Genetic Code used [integer, optional], default = 1-The Standard Code
  • 1-The Standard Code
  • 2-The Vertebrate Mitochondrial Code
  • 3-The Yeast Mitochondrial Code
  • 4-The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code
  • 5-The Invertebrate Mitochondrial Code
  • 6-The Ciliate, Dasycladacean and Hexamita Nuclear Code
  • 9-The Echinoderm and Flatworm Mitochondrial Code
  • 10-The Euplotid Nuclear Code
  • 11-The Bacterial, Archaeal and Plant Plastid Code
  • 12-The Alternative Yeast Nuclear Code
  • 13-The Ascidian Mitochondrial Code
  • 14-The Alternative Flatworm Mitochondrial Code
  • 15-Blepharisma Nuclear Code
  • 16-Chlorophycean Mitochondrial Code
  • 21-Trematode Mitochondrial Code
  • 22-Scenedesmus Obliquus Mitochondrial Code
  • 23-Thraustochytrium Mitochondrial Code
  • See all documented codes at http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi.
-g, -nogap Remove aligned codons with gaps
-t, -nomismatch Remove mismatched codons
-k, -kaks Enable using KaKs_Calculator for Ka and Ks estimation (requiring axt format)
-v, -verbose Verbose output
-?, -help Display help information

Example

Before running, please make sure ParaAT.pl, Epal2nal.pl and multiple sequence aligner (clustalw2 | t_coffee | mafft | muscle) are put into the global directory and can be accessible from any working directory.

Supposing the following files.

File Name Description
hmc.homolog 9,712 human-mouse-chimp orthologous groups
hmc.pep A file containing all amino acids sequences for human, mouse, chimp
hmc.cds A file containing all nucleotide sequences for human, mouse, chimp
hmc.proc A file containing a number, viz., the number of processor to be used

Thus, the command to run ParaAT is below and the results are outputted into a folder named "hmc-output":

ParaAT.pl -homolog hmc.homolog -aminoacid hmc.pep -nuc hmc.cds -processor hmc.proc -output hmc-output

To output the axt-formatted sequences, the command is:

ParaAT.pl -homolog hmc.homolog -aminoacid hmc.pep -nuc hmc.cds -processor hmc.proc -output hmc-output -format axt

To output the axt-formatted sequences and perform Ka and Ks calculations, the command is:

ParaAT.pl -homolog hmc.homolog -aminoacid hmc.pep -nuc hmc.cds -processor hmc.proc -output hmc-output -format axt -kaks

The outputted sequences with AXT format are below.

NP_000006-NP_032699 ATGGACATTGAAGCATATTTTGAAAGAATTGGCTATAAGAACTCTAGGAACAAATTGGACTTGGAAACATTAACTGACATTCTTGAGCACCAGATCCGGGCTGTTCCCTTTGAGAACCTTAACATGCATTGTGGGCAAGCCATGGAGTTGGGCTTAGAGGCTATTTTTGATCACATTGTAAGAAGAAACCGGGGTGGGTGGTGTCTCCAGGTCAATCAACTTCTGTACTGGGCTCTGACCACAATCGGTTTTCAGACCACAATGTTAGGAGGGTATTTTTACATCCCTCCAGTTAACAAATACAGCACTGGCATGGTTCACCTTCTCCTGCAGGTGACCATTGACGGCAGGAATTACATTGTCGATGCTGGGTCTGGAAGCTCCTCCCAGATGTGGCAGCCTCTAGAATTAATTTCTGGGAAGGATCAGCCTCAGGTGCCTTGCATTTTCTGCTTGACAGAAGAGAGAGGAATCTGGTACCTGGACCAAATCAGGAGAGAGCAGTATATTACAAACAAAGAATTTCTTAATTCTCATCTCCTGCCAAAGAAGAAACACCAAAAAATATACTTATTTACGCTTGAACCTCGAACAATTGAAGATTTTGAGTCTATGAATACATACCTGCAGACGTCTCCAACATCTTCATTTATAACCACATCATTTTGTTCCTTGCAGACCCCAGAAGGGGTTTACTGTTTGGTGGGCTTCATCCTCACCTATAGAAAATTCAATTATAAAGACAATACAGATCTGGTCGAGTTTAAAACTCTCACTGAGGAAGAGGTTGAAGAAGTGCTGAGAAATATATTTAAGATTTCCTTGGGGAGAAATCTCGTGCCCAAACCTGGTGATGGATCCCTTACTATT ATGGACATCGAAGCATACTTTGAAAGGATTGGTTACAAGAACTCAGTGAATAAATTGGACTTAGCCACATTAACTGAAGTTCTTCAGCACCAGATGCGAGCAGTTCCTTTTGAGAATCTTAACATGCATTGTGGAGAAGCCATGCATCTGGATTTACAGGACATTTTTGACCACATAGTAAGGAAGAAGAGAGGTGGATGGTGTCTCCAGGTTAATCATCTGCTGTACTGGGCTCTGACCAAAATGGGCTTTGAAACCACAATGTTGGGAGGATATGTTTACATAACTCCAGTCAGCAAATATAGCAGTGAAATGGTCCACCTTCTAGTACAGGTGACCATCAGTGACAGGAAGTACATTGTGGATTCCGCCTATGGAGGCTCCTACCAGATGTGGGAGCCTCTGGAATTAACATCTGGGAAGGATCAGCCTCAGGTGCCTGCCATCTTCCTTTTGACAGAGGAGAATGGAACCTGGTACTTGGACCAAATCAGAAGAGAGCAGTATGTTCCAAATGAAGAATTTGTTAACTCAGACCTCCTTGAAAAGAACAAATATCGAAAAATCTACTCCTTTACTCTTGAGCCCCGAGTTATCGAGGATTTTGAATATGTGAATAGCTATCTTCAGACATCGCCAGCATCTGTGTTTGTAAGCACATCGTTCTGTTCCTTGCAGACCTCGGAAGGGGTTCACTGTTTAGTGGGCTCCACCTTTACAAGTAGGAGATTCAGCTATAAGGACGATGTAGATCTGGTTGAGTTTAAATATGTGAATGAGGAAGAAATAGAAGATGTACTGAAAACCGCATTTGGCATTTCTTTGGAGAGAAAGTTTGTGCCCAAACATGGTGAACTAGTTTTTACTATT NP_000008-NP_031409 ATGGCCGCCGCGCTGCTCGCCCGGGCCTCGGGCCCTGCCCGCAGAGCTCTCTGTCCTAGGGCCTGGCGGCAGTTACACACCATCTACCAGTCTGTGGAACTGCCCGAGACACACCAGATGTTGCTCCAGACATGCCGGGACTTTGCCGAGAAGGAGTTGTTTCCCATTGCAGCCCAGGTGGATAAGGAACATCTCTTCCCAGCGGCTCAGGTGAAGAAGATGGGCGGGCTTGGGCTTCTGGCCATGGACGTGCCCGAGGAGCTTGGCGGTGCTGGCCTCGATTACCTGGCCTACGCCATCGCCATGGAGGAGATCAGCCGTGGCTGCGCCTCCACCGGAGTCATCATGAGTGTCAACAACTCTCTCTACCTGGGGCCCATCTTGAAGTTTGGCTCCAAGGAGCAGAAGCAGGCGTGGGTCACGCCTTTCACCAGTGGTGACAAAATTGGCTGCTTTGCCCTCAGCGAACCAGGGAACGGCAGTGATGCAGGAGCTGCGTCCACCACCGCCCGGGCCGAGGGCGACTCATGGGTTCTGAATGGAACCAAAGCCTGGATCACCAATGCCTGGGAGGCTTCGGCTGCCGTGGTCTTTGCCAGCACGGACAGAGCCCTGCAAAACAAGGGCATCAGTGCCTTCCTGGTCCCCATGCCAACGCCTGGGCTCACGTTGGGGAAGAAAGAAGACAAGCTGGGCATCCGGGGCTCATCCACGGCCAACCTCATCTTTGAGGACTGTCGCATCCCCAAGGACAGCATCCTGGGGGAGCCAGGGATGGGCTTCAAGATAGCCATGCAAACCCTGGACATGGGCCGCATCGGCATCGCCTCCCAGGCCCTGGGCATTGCCCAGACCGCCCTCGATTGTGCTGTGAACTACGCTGAGAATCGCATGGCCTTCGGGGCGCCCCTCACCAAGCTCCAGGTCATCCAGTTCAAGTTGGCAGACATGGCCCTGGCCCTGGAGAGTGCCCGGCTGCTGACCTGGCGCGCTGCCATGCTGAAGGATAACAAGAAGCCTTTCATCAAGGAGGCAGCCATGGCCAAGCTGGCCGCCTCGGAGGCCGCGACCGCCATCAGCCACCAGGCCATCCAGATCCTGGGCGGCATGGGCTACGTGACAGAGATGCCGGCAGAGCGGCACTACCGCGACGCCCGCATCACTGAGATCTACGAGGGCACCAGCGAAATCCAGCGGCTGGTGATCGCCGGGCATCTGCTCAGGAGCTACCGGAGC ATGGCTGCCGCCTTGCTCGCCCGGGCCCGTGGCCCTCTCCGTAGAGCTCTCGGTGTTCGGGACTGGCGACGGTTACACACTGTTTACCAGTCTGTGGAGCTGCCTGAGACACACCAGATGTTGCGTCAGACATGCCGTGACTTTGCCGAGAAGGAGTTGGTCCCCATTGCGGCCCAGCTGGACAGGGAGCATCTCTTCCCCACAGCTCAGGTTAAGAAGATGGGTGAGCTCGGGCTGCTGGCCATGGATGTGCCAGAGGAGCTGAGTGGTGCAGGCTTGGATTACCTGGCCTACTCCATCGCCCTGGAGGAGATCAGCCGTGCCTGCGCCTCCACGGGAGTTATCATGAGCGTCAACAATTCTCTCTACTTGGGACCCATTCTGAAGTTTGGATCCGCACAGCAGAAGCAACAGTGGATCACCCCTTTCACCAATGGTGACAAAATCGGCTGTTTTGCCCTCAGTGAGCCAGGCAATGGCAGTGATGCTGGAGCCGCTTCCACCACTGCCCGGGAAGAGGGTGACTCATGGGTCCTCAACGGCACCAAAGCTTGGATCACCAACTCCTGGGAGGCTTCCGCCACGGTGGTATTTGCCAGCACAGACAGGTCCCGGCAGAACAAGGGTATCAGTGCCTTCCTGGTTCCCATGCCAACTCCTGGGCTCACGCTGGGCAAGAAGGAAGACAAGCTGGGCATCCGGGCCTCCTCCACAGCTAACCTCATCTTTGAGGACTGCCGGATCCCCAAGGAGAACCTGCTTGGGGAGCCTGGGATGGGCTTCAAAATAGCCATGCAAACCCTGGACATGGGTCGCATTGGCATCGCCTCCCAGGCCCTGGGCATCGCCCAGGCCTCCCTGGATTGTGCTGTGAAGTATGCCGAGAACCGCAATGCCTTTGGGGCACCGCTCACCAAGCTCCAAAATATCCAGTTCAAGCTGGCAGACATGGCCCTGGCCCTGGAGAGTGCCCGCCTGCTGACCTGGCGTGCTGCCATGTTGAAAGACAACAAGAAACCTTTCACCAAGGAGTCCGCCATGGCCAAACTGGCTGCATCGGAGGCTGCAACCGCCATTAGCCACCAGGCCATCCAGATCCTGGGCGGCATGGGGTATGTGACAGAGATGCCGGCTGAGCGGTACTACCGAGATGCCCGCATCACTGAGATCTACGAAGGGACCAGCGAAATCCAGAGACTGGTGATCGCTGGGCATCTGCTCCGGAGCTACCGGAGC