Installation

GAAP is in Python scripts (Python version of 2.7 or above is required to run the program), so it is unnecessary to compile. However, extra programs Bowtie2 and BLAT are required to run GAAP. Bowtie2 is available from http://bowtie-bio.sourceforge.net/bowtie2/index.shtml. BLAT is available from ftp://ftp.ncbi.nih.gov/blast/executables/release/.

Before start, put Bowtie2 and BLAT to your $PATH: 'export PATH=path-to-bowtie2:$PATH' and 'export PATH=path-to-blat:$PATH'.

Besides, PGAP is recommended to produce gene cluster file, which is needed to run cGOF identification.

Manual

Inputs

  • Paired-end reads in FASTA or FASTQ format.
  • Contigs or scaffolds assembled by de novo DNA-Seq assembler (SOAPdenovo, Velvet, ABySS, ALLPATHS-LG, etc.).
  • Gene cluster files generated from genomes of a closely related species (no less than ten genomes are recommended), including ortholog cluster file and corresponding ptt files and nucleotide sequences from NCBI ftp. See "examples" for detail.

Using GAAP

  • cGOF identification

    python cgof_identification.py [-h] [-s SEGMENT_LENGTH] [-o OUTPUT_DIR] NAME CLUSTER_FILE PTT_DIR REFERENCE_FILE

    Inputs:

    • NAME is the prefix of output files.
    • CLUSTER_FILE is the identified ortholog clusters among multiple genomes, '1.Orthologs_Cluster.txt' generated by using PGAP is recommended.
    • PTT_DIR is the directory of ptt files. The file names must be unique and consistent with those in the first row of CLUSTER_FILE.
    • REFERENCE_FILE is nucleotide sequences of orthologs. The gene ID could be named as 15925706, EC002, orf_077, etc, but must by uniqe and contains those in the second column of CLUSTER_FILE.

    Outputs:

    • NAME.output.fasta contains the cGOF genes of segments.
    • NAME.output.freq contains the permutation, i.e. order and orientation of cGOF segments in reference genomes.
    • -o is the existing directory for output files, by default in current directory.

    Options:

    • -s is the minimum count of cGOF genes allowed in one segment (default -s 2).
    • -h is usage of cgof_identification.py (optional).
  • Scaffolding

    python scaffolding.py [-h] [-r R] [-m MAX_INSERT] [-n MIN_INSERT] [-s SCAFFOLD_SIZE] [-o BLAT_OUTPUT] [-c] READS_1 READS_2 SCAFFILDS_FASTA GOF_FASTA SEGMENT_FREQUENCY OUTPUT_DIR

    Inputs:

    • READS_1 is the the first read of PE data in fasta/fastq.
    • READS_2 is the the second read of PE data in fasta/fastq.
    • SCAFFILDS_FASTA is the target contigs/scaffolds.Space and "-" are NOT allowed in the ">" lines.
    • GOF_FASTA is the cGOF genes of segments, i.e. NAME.output.fasta from cgof_identification.py.
    • SEGMENT_FREQUENCY is order and orientation of segments in references, i.e. NAME.output.freq from cgof_identification.py.

    Outputs:

    • -o is the prefix of outputs (default -o BLAT_OUTPUT).
    • BLAT_OUTPUT.genome.fasta contains a pseudo genome sequence.
    • BLAT_OUTPUT.segmented.fasta contains ordered and oriented scaffolds and contigs.
    • BLAT_OUTPUT.unused.fasta contains scaffolds and contigs that were unused in scaffolding.
    • BLAT_OUTPUT.err lists names of conflicting contigs/scaffolds which were removed from scaffolding.

    Options:

    • -s is the minimum length of cGOF genes for alignment (default -s 300).
    • -m is the maximum insert length (default -m 600).
    • -n is the minimum insert length (default -n 400).
    • -r indicates whether to remove reads duplication (-r R) or not (skip the option). "R" is an integer, and one PE is remained if the first R bases of more than one PE reads are identical. (defat skip the option).
    • -h is the usage of scaffolding.py (optional).

Example

Here we provide two examples, S.aureus and S.suis.

  • For S.aureus, we run command lines as follows.

    python cgof_identification.py -s 10 -o output/ sau sau_gene.cluster ptt_dir/ NC_002745.nuc

    In this step, GAAP takes files as follows as input, and outputs sau.output.fasta and sau.output.freq, and the minimum segment length is set to 10 cGOF genes.

    CLUSTER_FILE (sau_gene.cluster)

    PTT_DIR (ptt_dir/)

    REFERENCE_FILE (NC_002745.nuc)

    NAME.output.fasta (sau.output.fasta)."S"indicates segment.

    NAME.output.freq (sau.output.freq)

    python scaffolding.py -m 550 -n 450 -s 300 -o sau -c sau_1.fa sau_2.fa sau_scafseq.fa output/sau.output.fasta output/sau.output.freq output/

    In this step, GAAP takes paired-end reads (sau_1.fa and sau_2.fa, and the insert length is 450~550bp), target scaffolds (sau.scafseq, and the minimum length is set to 300bp), sau.output.fasta and sau.output.freq as input, and outputs sau.genome.fasta, sau.segmented.fasta and sau.unsed.fasta in output/ directory.

  • For S.ssuis, we run command lines as follows.

    python cgof_identification.py -o output/ ssuis ssuis_gene.cluster ptt_dir/ NC_009442.nuc

    In this step, GAAP takes CLUSTER_FILE (ssuis_gene.cluster), PTT_DIR (ptt_dir/) and REFERENCE_FILE (NC_009442.nuc) as input, and outputs ssuis.output.fasta and ssuis.output.freq.

    NAME.output.freq (ssuis.output.freq)."S"indicates segment,and "rc" indicates reverse complement.

    python scaffolding.py -r 50 -m 550 -n 450 -s 300 -o ssuis -c ssuis_1.fa ssuis_2.fa ssuis_scafseq.fa output/ssuis.output.fasta output/ssuis.output.freq output/

    In this step, GAAP takes paired-end reads (ssuis_1.fa and ssuis_2.fa), target scaffolds (ssuis.scafseq), ssuis.output.fasta and ssuis.output.freq as input, and outputs ssuis.genome.fasta, ssuis.segmented.fasta and ssuis.unused.fasta in output/ directory. The insert length of reads is 450~550bp (-m 550 -n 450), the minimum length of scaffolds is set to 300bp (-s 300), and one read pairs are remained in the input reads data if the first 50bp are identical in more than one pairs (-r 50).