GEN Toolkit - Gene Expression Nebulas

A data portal of transcriptomic profiles analyzed by a unified pipeline across multiple species

A data portal of transcriptome profiles across multiple species

GENtoolkit provides powerful pipelines which can handle both bulk and single-cell (10X Genomics, Smart-seq2, Drop-seq and inDrop) RNA-seq data. All gene/transcript expression profiles deposited in Gene Expression Nebulas are processed based on GENtoolkit. GENtoolkit is composed of two main parts which correspond to upstream and downstream analysis pipelines respectively. Specifically, upstream analysis module includes 4 steps, 'index building', 'quality control', 'read alignment', 'gene expression quantification', while downstream analysis module includes 2 main steps, 'analysis of gene expression profiles' and 'visualization of analysis results'. Raw data in the format of 'sra' or 'fastq' (single-end or paired-end) are both supported for further gene/transcript expression profiling. According to the needs of users, it is accessible to perform gene expression analysis in all or part samples from a dataset.

Prerequisite software and packages

1.1 Bulk RNA-seq or Single-Cell RNA-seq (Smart-seq2)

1.1.1 Index building

HISAT2 v2.0.5 (http://daehwankimlab.github.io/hisat2/) RSEM v1.3.1 (https://github.com/deweylab/RSEM/releases/tag/v1.3.1) STAR v2.7.1a (https://github.com/alexdobin/STAR/releases/tag/2.7.1a)

1.1.2 Quality control

Fastp v0.20.0 (https://github.com/OpenGene/fastp/releases/tag/v0.20.0)

1.1.3 Alignment and quantification

fasterq_dump (https://github.com/ncbi/sra-tools/releases/tag/2.10.9) Fastp v0.20.0 (https://github.com/OpenGene/fastp/releases/tag/v0.20.0) HISAT2 v2.0.5 (http://daehwankimlab.github.io/hisat2/) SAMtools v1.9 (http://github.com/samtools/) RseQC v2.6.4 (http://rseqc.sourceforge.net/) RSEM v1.3.1 (https://github.com/deweylab/RSEM/releases/tag/v1.3.1) STAR v2.7.1a (https://github.com/alexdobin/STAR/releases/tag/2.7.1a)

1.1.4 R packages for gene expression visualization (R > 4.0.5)

GetoptLong, tidyverse, data.table, tidyr, ggpubr, DESeq2, BiocParallel, pheatmap, clusterProfiler, enrichplot, org.Hs.eg.db, topGO, KEGG.db, DO.db, DOSE, Rgraphviz, GENIE3, WGCNA, dplyr, future

1.2 Single Cell RNA-seq (10X Genomics)

1.2.1 Index building

CellRanger v3.1.0 (https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/3.1/what-is-cell-ranger)

1.2.2 Alignment and quantification

fastq-dump (https://github.com/ncbi/sra-tools/releases/tag/2.10.9) CellRanger v3.1.0 (https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/3.1/what-is-cell-ranger)

1.2.3 Visualization

GetoptLong, tidyverse, data.table, tidyr, Seurat

1.3 Single Cell RNA-seq (Drop-seq or inDrop)

1.3.1 Index building

RSEM v1.3.1 (https://github.com/deweylab/RSEM/releases/tag/v1.3.1) STAR v2.7.1a (https://github.com/alexdobin/STAR/releases/tag/2.7.1a)

1.3.2 Alignment and quantification

fasterq_dump (https://github.com/ncbi/sra-tools/releases/tag/2.10.9) STAR v2.7.1a (https://github.com/alexdobin/STAR/releases/tag/2.7.1a) dropEst v0.8.6 (https://github.com/kharchenkolab/dropEst/releases/tag/v0.8.6)

Download and install

2.1 Download

GENtoolkit.tar.gz

2.2 Install

tar -zvxf GENtoolkit.tar.gz

2.3 Recommended file structure

Note: The main input and output files and paths of the whole project are shown above. And this directory structure is recommended.

Usage and option summary

3.1 One-Stop analysis

python GENtoolkit.py [options] ...

3.2 Upstream analysis

3.2.1 Bulk RNA-seq data

python GENtoolkit.py [options] -blt Bulk -ipp
                                            ../Oryza_sativa -rgf reference.fasta -rgg reference.gtf -sp
                                            ../STAR/bin/Linux_x86_64 -rd ../01.raw -st pair -hi
                                            ../Oryza_sativa_hisat2/genome -ri
                                            ../Oryza_sativa_RSEM/Oryza_sativa -bf reference.bed

Note: Test data comes from GEN (GEND000254 and GEND000257)

3.2.2 scRNA-seq data (Smart-seq2)

python GENtoolkit.py [options] -blt Smart-seq2 -ipp ../Oryza_sativa -rgf reference.fasta -rgg reference.gtf -sp ../STAR/bin/Linux_x86_64 -rd ../01.raw -st pair -hi ../Oryza_sativa_hisat2/genome -ri ../Oryza_sativa_RSEM/Oryza_sativa -bf reference.bed

Note: Test data comes from GEN (GEND000305)

3.2.3 scRNA-seq data（10X Genomics)


                                            python GENtoolkit.py [options] -blt 10X -ipp ../Saccharomyces_cerevisiae -rgf reference.fasta -rgg reference.gtf -sp ../STAR/bin/Linux_x86_64 -rd ../01.raw -ci ../Saccharomyces_cerevisiae/Saccharomyces_cerevisiae.genome

Note: Test data comes from GEN (GEND000326). If some samples run failed because of the limitation of cores, or mem, or something else, "-da Designated_samples -sl sample1-sample2-sample3" can help re-run.

scRNA-seq data (Drop-seq or inDrop)


                                            python GENtoolkit.py -blt Drop-seq -ipp ../Homo_sapiens -rgf reference.fasta -rgg reference.gtf  -si  ../STAR/bin/Linux_x86_64 -dc reference.xml  -dm ../mit_genes_human.ensembl.rds -rd ../01.raw

Note: Test data comes from GEN (GEND000158). If "-da Designated_samples" selected, the parameter "-sl []" is necessary!

3.3 Downstream analysis

3.3.1 Bulk RNA-seq data


                                            python GENtoolkit.py [options] --stream down --BuildLibraryType Bulk --workpath ./workpath/ --exprData exprData.txt --metaData metaData.txt

3.3.2 scRNA-seq data (Smart-seq2)


                                            python GENtoolkit.py [options] --stream down --BuildLibraryType Smart-seq2 --workpath ./workpath/ --exprData exprData.txt --metaData metaData.txt -- refpath ./ref/

3.3.3 scRNA-seq data（10X Genomics)


                                            python GENtoolkit.py [options] --stream down --BuildLibraryType 10X --workpath ./workpath/ --exprData ./data/ (including barcodes.tsv.gz, features.tsv.gz, matrix.mtx.gz) -- refpath ./ref/

Options (GENtoolkit.py)

Upstream parameter

Long parameter	Short parameter	Description
--stream	-stream	Pipeline type, [ up \| down \| all ]
--IndexBuild	-ib	Once a reference genome is built, it can be used many times in a species gene expression analysis, which depends on whether building index or not. [ index_build \| index_exist ] Default: index_build
--BuildLibraryType	-blt	Library building type, [ Bulk \| 10X \| Smart-seq2 \| inDrop \| Drop-seq ]
--IndexProjectPath	-ipp	The absolute path of the index-building project (Species name).
--Hisat2ThreadNum	-htn	The thread number for hisat2 index-building. Default: 30
--RSEMThreadNum	-rtn	The thread number for RSEM index-building. Default: 30
--ReferenceGenomeFasta	-rgf	The fasta file of reference genome.
--ReferenceGenomeGtf	-rgg	The gtf file of reference genome.
--StarPath	-sp	The absolute path of the software STAR.
--DesignatedAll	-da	The sample list you want to run at once. [ Designated_samples \| All_samples ] Default: All_samples
--SampleList	-sl	The list samples you designated. [ Sample1-Sample2-Sample3 ] Note: The short dashes are needed.
--SequencingType	-st	[ pair \| single ]
--ReadType	-rt	[ sra \| fastq ] Default: fastq
--RawData	-rd	The absolute path of raw data
--Hisat2Index	-hi	The Hisat2 reference genome index, for example, ../Homo_Sapiens_hisat2/genome
--RSEMIndex	-ri	The RSEM reference genome index, for example, ../Homo_Sapiens_RSEM/Homo_Spaiens
--BedFile	-bf	The browse extensive data file (.bed file).
--FasterqDumpThread	-fdt	The thread number for fasterq-dump. Default: 12
--Fastp_q	-fq	The parameter "-q" in fastp. Default: 20
--Fastp_u	-fu	The parameter "-u" in fastp. Default: 20
--Fastp_l	-fl	The parameter "-l" in fastp. Default: 50
--Fastp_W	-fW	The parameter "-W" in fastp. Default: 4
--Fastp_M	-fM	The parameter "-M" in fastp. Default: 12
--Fastp_w	-fw	The parameter "-w" in fastp. Default: 12
--Hisat2_p	-hp	The thread number for hisat2. Default: 12
--SamtoolsThread	-std	The thread number for samtools. Default: 12
--RSEMThread	-rtd	The thread number for RSEM. Default: 12
--CellrangerIndex	-ci	The absolute path of cellranger index, for example, ../Homo_Sapiens/Homo_Sapiens.genome
--CellrangerLocalCores	-clc	Caution! The default localcores may not be appropriate all the time, you can adjust the localcores according to https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/count. Default: 12
--CellrangerLocalMem	-clm	Caution! The default localmem may not be appropriate all the time, you can adjust the localmem according to https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/count. Default: 64
--DropTag_p	-dp	The thread number of dropTag process. Default: 12
--StarRunThreadN	-srtn	The thread number of STAR alignment. Default: 12
--DropEst_c	-dc	The configure file (.xml file) of DropEst process.
--DropReport_m	-dm	The reference organelle gene's rds file.

Downstream parameter

Long parameter	Short parameter	Description
--workpath	-wp	The absolute path of work path.
--exprData	-exp	Expression matrix file, if stream is "all", it would use upstream results.
--metaData	-md	Meta information file.
--refpath	-rf	Reference file for single cell data.
--geneList	-gl	Gene list file.
--thread	-td	Number of thread for downstream. Default: 10
--picType	-pic	Picture file format, [ svg \| pdf \| png ]. Default: svg
--rowSums_count_cutoff	-rc	Low count filter. The gene under rowSums count cutoff in samples would be deleted. Default: 2
--pValue_cutoff	-pc	The p value cutoff in diferent analysis. Default: 0.05
--padj_cutoff	-qc	The q value cutoff in diferent analysis. Default: 0.1
--log2FoldChange_Up	-F	Log2(FoldChange) in up regulation cutoff. Default: 1
--log2FoldChange_Down	-f	Log2(FoldChange) in down regulation cutoff. Default: -1
--topnum	-tn	Top number gene cutoff. Using how many top gene as result. Default: 20
--module	-module	Select module to export from colorlevels in WGCNA. Default: turquoise
--outputType	-ot	Select export format, [ VisANT \| cytoscape ]. Default: cytoscape
--nTop	-ntop	Export network cutoff: the number of the top gene to be filtered and exported in selected module. Default: 10000
--NetworkThreshold	-nt	Export Network cutoff: weight > NetworkThreshold. Default: 0.02
--GeneSign	-gs	Hub gene cutoff: significance of single gene-trait correlation > GeneSign. Default:0.1
--absdatKME	-KME	Hub gene cutoff: eigengene connectivity (kME) value > absdatKME Default:0.1
--qWeight	-qw	Weight cutoff. Default:0.9
--GOAnalysis	-GO	GO enrichment analysis, [ yes \| no ]. Default: yes
--KEGGAnalysis	-KEGG	KEGG enrichment analysis, [ yes \| no ]. Default: yes
--DOAnalysis	-DO	DO enrichment analysis, [ yes \| no ]. Default: yes
--ontology	-ontology	GO Ontology of GO enrichment analysis, [ ALL \| BP \| MF \| CC ]. Default: ALL
--pValue_cutoff_enrich	-pen	Enrichment analysis cutoff to p value. Default:0.05
--padj_cutoff_enrich	-qen	Enrichment analysis cutoff to q value. Default:0.1
--Biological_Condition	-bc	Choose one case or control, consistent with meta table. Default:Case1
--TreeMethod	-tm	TreeMethod, default, [RF (Random Forests) \| ET (Extra-Trees)]. Default: RF
--nTrees	-ntrees	Number of trees in an ensemble for each target gene. Default:50
--min_cells	-	The minimum number of cells. Default:5
--max_cells	-	The maximum number of cells. Default:10000
--min_genes	-	The minimum number of genes. Default:200
--max_genes	-	The maximum number of genes. Default:4000
--max_mito	-	The maximum number of mito. Default:0.05
--PCnumber	-	The top PC number for find Neighbor. Default:15
--min_Resolution	-	The minimum Resolution for pre-find cluster. Default:0.6
--max_Resolution	-	The maximum Resolution for pre-find cluster. Default:1.2
--intervals_Resolution	-	The Resolution for pre-find cluster. Default:0.2
--Resolution	-	The Resolution for find cluster. Default:1
--method	-	[ tsne \| umap ]. Default:tsne
--whichCluster	-	Choose one cluster for find marker and show on TSNE plot, the first cluster label as 0 . Default:1
--ref	-	Choose a reference, default [ 1-6 ], 1 HumanPrimaryCellAtlasData, 2 BlueprintEncodeData, 3 MonacoImmuneData, 4 NovershternHematopoieticData, 5 DatabaseImmuneCellExpressionData, 6 HumanPrimaryCellAtlasData and BlueprintEncodeData. Default:1
--annolevel	-	[ main \| fine ]. Default:main

Gene Expression Nebulas

Gene Expression Nebulas Toolkit (GENtoolkit)