KaKs Documents - BIG Data Center - National Genomics Data Center (CNCB

Home Tools

KAKS

Introduction

KaKs_Calculator is a program that calculates nonsynonymous (Ka) and synonymous (Ks) substitution rates through model selection and model averaging. In addition, several currently acknowledged methods for estimating Ka and Ks are also incorporated into it.

The KaKs_Calculator package, including source codes, compiled executables and documentation, is freely available for academic use only at here.

Installation

For high efficiency and compatibility with more platforms, the kernel codes of KaKs_Calculator are written in standard C++. For Windows version we use Visual C++ 6.0 for GUI (Graphics User Interface). And for MAC version we use Objective C for GUI.

Linux/Unix

KaKs_Calculator has been tested on AIX, IRIX and Solaris.

Unpack the package of KaKs_CalculatorXXX.tar.gz by the following commands.
gzip –d KaKs_CalculatorXXX.tar.gz tar –xf KaKs_CalculatorXXX.tar
If you use other Linux/Unix OS, you have to compile the program in the source codes folder with the help of g++/gcc compiler by yourselves.
cd KaKs_CalculatorXXX/src make

MAC

KaKs_Calculator has been tested on MAC OS X version 10.6.6.

Open the disk image file of KaKs_Calculator_XXX.dmg.
Follow the installation instructions and drag the KaKs_Calculator folder into Applications folder on MAC.
Please find KaKs_Calculator folder in Applications.

Methods for Calculating Ka and Ks

Calculating Ka and Ks normally involves three steps. Let us assume that the length of DNA sequence to be compared is n and the number of substitutions between compared sequences is m. To calculate Ka and Ks, we need to count the numbers of synonymous (S) and nonsynonymous (N) sites (S + N = n) and the numbers of synonymous (Sd) and nonsynonymous (Nd) substitutions (Sd + Nd = m). Then it is after correcting multiple substitutions that (Nd/N) and (Sd/S) could represent Ka and Ks, respectively, since the observed number of substitutions underestimates the real number of substitutions as sequences diverge over time. Therefore, we can conclude from mentioned above that these methods normally involve three steps to estimate Ka and Ks: counting S and N, counting Sd and Nd, and correction for multiple substitutions.

Methods for calculating Ka and Ks adopt different substitution models with subtle yet significant differences. They can be classified as approximate methods and maximum-likelihood methods. Different from approximate methods, maximum-likelihood methods adopt the probability theory to finish all three steps mentioned above in one go.

Approximate Methods

There are several approximate methods incorporated into KaKs_Calculator, and we list their abbreviations in the program and their corresponding reference(s) as follows.

NG: Nei, M. and Gojobori, T. (1986)
LWL: Li, W.H., et al. (1985)
LPB: Li, W.H. (1993) and Pamilo, P. and Bianchi, N.O. (1993)
MLWL (Modified LWL), MLPB (Modified LPB): Tzeng, Y.H., et al. (2004)
YN: Yang, Z. and Nielsen, R. (2000)
MYN (Modified YN): Zhang, Z., et al. (2006)

Maximum-Likelihood Methods

The method of GY takes account of sequence evolutionary features, such as transition/transversion rate ratio and nucleotide frequencies (reflected in the HKY Model) and incorporates these features into a codon-based model. We extend this method to a set of candidate models in a maximum likelihood framework and use the AICc for model selection and model averaging.

GY: Goldman, N. and Yang, Z. (1994)
MS (Model Selection), MA (Model Averaging): based on a set of candidate models defined by Posada, D. (2003) as follows.

Model	Substitution Rates	Nucleotide Frequency
JC F81	r_TC=r_AG=r_TA=r_CG=r_TG=r_CA	Equal Unequal
K2P HKY	r_TC=r_AG ≠ r_TA=r_CG=r_TG=r_CA	Equal Unequal
TrNEF TrN	r_TC ≠ r_AG ≠ r_TA=r_CG=r_TG=r_CA	Equal Unequal
K3P K3PUF	r_TC=r_AG ≠ r_TA=r_CG ≠ r_TG=r_CA	Equal Unequal
TIMEF TIM	r_TC ≠ r_AG ≠ r_TA=r_CG ≠ r_TG=r_CA	Equal Unequal
TVMEF TVM	r_TC=r_AG ≠ r_TA ≠ r_CG ≠ r_TG ≠ r_CA	Equal Unequal
SYM GTR	r_TC≠r_AG≠r_TA≠r_CG≠r_TG≠r_CA	Equal Unequal

r_ij: substitution rate between i and j, where i ≠ j and i, j∈{A, C, G, T}

Format of Sequence

KaKs_Calculator accepts quasi-AXT sequence format as follows. Before calculation, gaps and stop codons between compared sequences will be removed. You can also see “example.axt” in the folder of “KaKs_CalculatorXXX/examples/”.

For example:

NP_000026 ATGCTCCTGTG-CCACTGGCC ATCCCC-TGCGCTCACTGGAC NP_000053 ACAGaTtCTACCc-GCCcACTA--GgtGtt ---ggTTCTCCtACCcA-G-CACTACTggg

Each pair of sequences in an AXT file contains three lines: one sequence name line and two sequence lines. Any pairwise sequence is separated from one another by one blank line.

Sequence name line
NP_000026
Pairwise sequences lines
ATGCTCCTGTG-CCACTGGCC ATCCCC-TGCGCTCACTGGAC

Parameters setting

Linux/Unix

KaKs_Calculator are more suitable for a large number of dataset to calculate Ka and Ks. It reads a pair of sequences and computes corresponding estimates one by one, so that it requires memory proportional to the maximum length among pairwise sequences. In addition, KaKs_Calculator allows user to choose more than one method to calculate Ka and Ks at one running time. The following is the parameters’ setting in Linux version.

-i AXT sequence file name for calculating Ka and Ks
-o File name for outputting results
-c Genetic code (Default = 1-Standard Code). For more information about the Genetic Codes, please see the link http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c
-m Methods for calculating Ka and Ks (Default = MA): NG, LWL, LPB, MLWL, MLPB, YN, MYN, GY, MS, MA, ALL (including all above methods)
-d File name for details about each candidate model only when using the method of MS or MA
-h Show help information

For example

use MA method and standard code
KaKs_Calculator -i test.axt -o test.axt.kaks
use MA method and vertebrate mitochondrial code
KaKs_Calculator -i test.axt -o test.axt.kaks -c 2
use MA method and standard code and output details of model selection on each candidate model
KaKs_Calculator -i test.axt -o test.axt.kaks -d test.axt.details
use LWL, YN and MYN and standard Code
KaKs_Calculator -i test.axt -o test.axt.kaks -m LWL -m YN -m MYN

Windows

The Windows version provides users with a friendly interface to select input sequences’ file, genetic code and method(s) for estimating Ka and Ks. During calculating you can minimize the application window and send it to tray. After finishing calculation, KaKs_Calculator allows users to export results to file or clipboard at will.

Output Format

KaKs_Calculator provides comprehensive information estimated from compared sequences, including numbers of synonymous and nonsynonymous sites, numbers of synonymous and nonsynonymous substitutions, GC contents, maximum-likelihood score, and AICC, in addition to synonymous and nonsynonymous substitution rates and their ratio. Meanwhile, Fisher’s exact test for small sample is applied to justify the validity of Ka and Ks calculated by these methods.

Sequence: Name of Pairwise sequence
Method: Name of method for calculation of Ka and Ks
Ka: Nonsynonymous substitution rate
Ks: Synonymous substitution rate
Ka/Ks: Selective strength
P-Value(Fisher): The value computed by Fisher exact test
Length: Sequence length (after removing gaps and stop codon(s))
S-Sites: Synonymous sites
N-Sites: Nonsynonymous sites
Fold-Sites(0:2:4): 0,2,4-fold degenerate sites
Substitutions: Substitutions between sequences
S-Substitutions: Synonymous substitutions
N-Substitutions: Nonsynonymous substitutions
Fold-S-Substitutions(0:2:4): Synonymous substitutions at 0,2,4-fold
Fold-N-Substitutions(0:2:4): Nonsynonymous substitutions at 0,2,4-fold
Divergence-Time: Divergence time
Substitution-Rate-Ratio(rTC:rAG:rTA:rCG:rTG:rCA/rCA): Ratios of six substitution rates to the substitution rate between C and A
GC(1:2:3): GC content of entire sequences and of three codon positions
ML-Score: Maximum likelihood score
AICc: Value of AICc
Akaike-Weight: Value of Akaike weight for model selection
Model: Selected model for the method of MS

Acknowledgements

We thank Professor Wen-Hsiung Li for providing us with his computer program and Professor Ziheng Yang for his invaluable source codes in PAML. We are grateful to Heng Li for his advice and Yafeng Hu for his help in software designing. We also thank all anonymous users for reporting bugs and sending suggestions.

Reference

Agresti, A. 1992. A Survey of Exact Inference for Contingency Tables. Statistical Science. 7, 131 -177.
Akaike, H. 1973 Information theory as an extension of the maximum likelihood principle. In Petrov, B.N. and Csaki, F. (eds), Second International Symposium on Information Theory. Akademiai Kiado, Budapest, 267-281
Akaike, H. 1974 A new look at the statistical model identification. IEEE Trans. Autom. Contr. 19, 716-723.
Bierne, N. and Eyre-Walker, A. 2003. The Problem of Counting Sites in the Estimation of the Synonymous and Nonsynonymous Substitution Rates: Implications for the Correlation Between the Synonymous Substitution Rate and Codon Usage Bias. Genetics. 165, 1587-1597.
Burnham, K.P. and Anderson, D.R. 2002 Model Selection and Multimodel Inference: A Practical Information Theoretic Approach. In. Springer-Verlag, New York, 488.
Burnham, K.P. and Anderson, D.R. 2004 Multimodel Inference: Understanding AIC and BIC in Model Selection, Sociological Methods Research, 33, 261-304.
Comeron, J.M. 1999. K-Estimator: calculation of the number of nucleotide substitutions per site and the confidence intervals. Bioinformatics. 15, 763-764.
Gillespie, J.H. 1991. The causes of molecular evolution. Oxford University Press, Oxford, England.
Goldman, N. and Yang, Z. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11, 725-736.
Hasegawa, M., H. Kishino, and T. Yano 1985. Dating the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22, 160-174.
Hurst, L.D. 2002. The Ka/Ks ratio: diagnosing the form of sequence evolution. Trends in Genetics. 18, 486-487.
Jukes, T.H., and C. R. Cantor 1969. Evolution of protein molecules, 21-123. In Munro, H.N. eds., Mammalian Protein Metabolism. Academic Press, New York.
Kimura, M. 1980. A simple method for estimating evolutionary rate of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16, 111-120.
Kimura, M. 1983. The neutral theory of molecular evolution. Cambridge University Press, Cambridge, England.
Li, W.H. 1993. Unbiased estimation of the Rates of synonymous and nonsynonymous substitution. J. Mol. Evol. 36, 96-99.
Li, W.H. 1997. Molecular evolution. Sinauer Associates. Sunderland, Mass.
Li, W.H., Wu, C.I. and Luo, C.C. 1985. A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes. Mol. Biol. Evol. 2, 150-174.
Muse, S.V. 1996. Estimating synonymous and nonsynonymous substitution rates. Mol. Biol. Evol. 13, 105-114.
Nei, M. and Gojobori, T. 1986. Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol Biol Evol. 3, 418-426.
Pamilo, P. and Bianchi, N.O. 1993. Evolution of the Zfx and Zfy genes: rates and interdependence between the genes. Mol. Biol. Evol. 10, 271-281.
Posada, D. 2003 Using Modeltest and PAUP* to select a model of nucleotide substitution. In Baxevanis, A.D. (ed), Current Protocols in Bioinformatics. JohnWiley & Sons, New York, 6.5.1-6.5.14.
Posada, D. and Buckley, T.R. 2004 Model Selection and Model Averaging in Phylogenetics: Advantages of Akaike Information Criterion and Bayesian Approaches over Likelihood Ratio Tests, Syst. Biol., 53, 793-808.
Sullivan, J. and Joyce, P. 2005 Model Selection in Phylogenetics, Annual Review of Ecology, Evolution, and Systematics, 36, 445-466.
Torrents, D., Suyama, M., Zdobnov, E. and Bork, P. 2003. A Genome-Wide Survey of Human Pseudogenes. Genome Res. 13, 2559-2567.
Tzeng, Y.-H., Pan, R. and Li, W.-H. 2004. Comparison of Three Methods for Estimating Rates of Synonymous and Nonsynonymous Nucleotide Substitutions. Mol. Biol. Evol. 21, 2290-2298.
Yang, Z. 1997. PAML: a program package for phylogenetic analysis by maximum likelihood. CABIOS. 13, 555-556.
Yang, Z. and Nielsen, R. 2000. Estimating Synonymous and Nonsynonymous Substitution Rates Under Realistic Evolutionary Models. Mol Biol Evol. 17, 32-43.
Zhang, Z., Li, J. and Yu, J. 2006 Computing Ka and Ks with a consideration of unequal transitional substitutions, BMC Evolutionary Biology, 6, 44.