Gene Expression Nebulas
A data portal of transcriptomic profiles analyzed by a unified pipeline across multiple species

Gene Expression Nebulas

A data portal of transcriptome profiles across multiple species

Gene Expression Nebulas Toolkit (GENtoolkit)

GENtoolkit provides powerful pipelines which can handle both bulk and single-cell (10X Genomics, Smart-seq2, Drop-seq and inDrop) RNA-seq data. All gene/transcript expression profiles deposited in Gene Expression Nebulas are processed based on GENtoolkit. GENtoolkit is composed of two main parts which correspond to upstream and downstream analysis pipelines respectively. Specifically, upstream analysis module includes 4 steps, 'index building', 'quality control', 'read alignment', 'gene expression quantification', while downstream analysis module includes 2 main steps, 'analysis of gene expression profiles' and 'visualization of analysis results'. Raw data in the format of 'sra' or 'fastq' (single-end or paired-end) are both supported for further gene/transcript expression profiling. According to the needs of users, it is accessible to perform gene expression analysis in all or part samples from a dataset.

Prerequisite software and packages

Download and install

2.1 Download

2.2 Install

tar -zvxf GENtoolkit.tar.gz

2.3 Recommended file structure

Note: The main input and output files and paths of the whole project are shown above. And this directory structure is recommended.

Usage and option summary

3.1 One-Stop analysis

python GENtoolkit.py [options] ...

3.2 Upstream analysis

3.2.1 Bulk RNA-seq data
python GENtoolkit.py [options] -blt Bulk -ipp ../Oryza_sativa -rgf reference.fasta -rgg reference.gtf -sp ../STAR/bin/Linux_x86_64 -rd ../01.raw -st pair -hi ../Oryza_sativa_hisat2/genome -ri ../Oryza_sativa_RSEM/Oryza_sativa -bf reference.bed

Note: Test data comes from GEN (GEND000254 and GEND000257)

3.2.2 scRNA-seq data (Smart-seq2)
python GENtoolkit.py [options] -blt Smart-seq2 -ipp ../Oryza_sativa -rgf reference.fasta -rgg reference.gtf -sp ../STAR/bin/Linux_x86_64 -rd ../01.raw -st pair -hi ../Oryza_sativa_hisat2/genome -ri ../Oryza_sativa_RSEM/Oryza_sativa -bf reference.bed

Note: Test data comes from GEN (GEND000305)

3.2.3 scRNA-seq data(10X Genomics)
python GENtoolkit.py [options] -blt 10X -ipp ../Saccharomyces_cerevisiae -rgf reference.fasta -rgg reference.gtf -sp ../STAR/bin/Linux_x86_64 -rd ../01.raw -ci ../Saccharomyces_cerevisiae/Saccharomyces_cerevisiae.genome

Note: Test data comes from GEN (GEND000326). If some samples run failed because of the limitation of cores, or mem, or something else, "-da Designated_samples -sl sample1-sample2-sample3" can help re-run.

scRNA-seq data (Drop-seq or inDrop)
python GENtoolkit.py -blt Drop-seq -ipp ../Homo_sapiens -rgf reference.fasta -rgg reference.gtf -si ../STAR/bin/Linux_x86_64 -dc reference.xml -dm ../mit_genes_human.ensembl.rds -rd ../01.raw

Note: Test data comes from GEN (GEND000158). If "-da Designated_samples" selected, the parameter "-sl []" is necessary!

3.3 Downstream analysis

3.3.1 Bulk RNA-seq data
python GENtoolkit.py [options] --stream down --BuildLibraryType Bulk --workpath ./workpath/ --exprData exprData.txt --metaData metaData.txt
3.3.2 scRNA-seq data (Smart-seq2)
python GENtoolkit.py [options] --stream down --BuildLibraryType Smart-seq2 --workpath ./workpath/ --exprData exprData.txt --metaData metaData.txt -- refpath ./ref/
3.3.3 scRNA-seq data(10X Genomics)
python GENtoolkit.py [options] --stream down --BuildLibraryType 10X --workpath ./workpath/ --exprData ./data/ (including barcodes.tsv.gz, features.tsv.gz, matrix.mtx.gz) -- refpath ./ref/

Options (GENtoolkit.py)

Upstream parameter

Long parameter Short parameter Description
--stream -stream Pipeline type, [ up | down | all ]
--IndexBuild -ib Once a reference genome is built, it can be used many times in a species gene expression analysis, which depends on whether building index or not. [ index_build | index_exist ] Default: index_build
--BuildLibraryType -blt Library building type, [ Bulk | 10X | Smart-seq2 | inDrop | Drop-seq ]
--IndexProjectPath -ipp The absolute path of the index-building project (Species name).
--Hisat2ThreadNum -htn The thread number for hisat2 index-building. Default: 30
--RSEMThreadNum -rtn The thread number for RSEM index-building. Default: 30
--ReferenceGenomeFasta -rgf The fasta file of reference genome.
--ReferenceGenomeGtf -rgg The gtf file of reference genome.
--StarPath -sp The absolute path of the software STAR.
--DesignatedAll -da The sample list you want to run at once. [ Designated_samples | All_samples ] Default: All_samples
--SampleList -sl The list samples you designated. [ Sample1-Sample2-Sample3 ] Note: The short dashes are needed.
--SequencingType -st [ pair | single ]
--ReadType -rt [ sra | fastq ] Default: fastq
--RawData -rd The absolute path of raw data
--Hisat2Index -hi The Hisat2 reference genome index, for example, ../Homo_Sapiens_hisat2/genome
--RSEMIndex -ri The RSEM reference genome index, for example, ../Homo_Sapiens_RSEM/Homo_Spaiens
--BedFile -bf The browse extensive data file (.bed file).
--FasterqDumpThread -fdt The thread number for fasterq-dump. Default: 12
--Fastp_q -fq The parameter "-q" in fastp. Default: 20
--Fastp_u -fu The parameter "-u" in fastp. Default: 20
--Fastp_l -fl The parameter "-l" in fastp. Default: 50
--Fastp_W -fW The parameter "-W" in fastp. Default: 4
--Fastp_M -fM The parameter "-M" in fastp. Default: 12
--Fastp_w -fw The parameter "-w" in fastp. Default: 12
--Hisat2_p -hp The thread number for hisat2. Default: 12
--SamtoolsThread -std The thread number for samtools. Default: 12
--RSEMThread -rtd The thread number for RSEM. Default: 12
--CellrangerIndex -ci The absolute path of cellranger index, for example, ../Homo_Sapiens/Homo_Sapiens.genome
--CellrangerLocalCores -clc Caution! The default localcores may not be appropriate all the time, you can adjust the localcores according to https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/count. Default: 12
--CellrangerLocalMem -clm Caution! The default localmem may not be appropriate all the time, you can adjust the localmem according to https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/count. Default: 64
--DropTag_p -dp The thread number of dropTag process. Default: 12
--StarRunThreadN -srtn The thread number of STAR alignment. Default: 12
--DropEst_c -dc The configure file (.xml file) of DropEst process.
--DropReport_m -dm The reference organelle gene's rds file.

Downstream parameter

Long parameter Short parameter Description
--workpath -wp The absolute path of work path.
--exprData -exp Expression matrix file, if stream is "all", it would use upstream results.
--metaData -md Meta information file.
--refpath -rf Reference file for single cell data.
--geneList -gl Gene list file.
--thread -td Number of thread for downstream. Default: 10
--picType -pic Picture file format, [ svg | pdf | png ]. Default: svg
--rowSums_count_cutoff -rc Low count filter. The gene under rowSums count cutoff in samples would be deleted. Default: 2
--pValue_cutoff -pc The p value cutoff in diferent analysis. Default: 0.05
--padj_cutoff -qc The q value cutoff in diferent analysis. Default: 0.1
--log2FoldChange_Up -F Log2(FoldChange) in up regulation cutoff. Default: 1
--log2FoldChange_Down -f Log2(FoldChange) in down regulation cutoff. Default: -1
--topnum -tn Top number gene cutoff. Using how many top gene as result. Default: 20
--module -module Select module to export from colorlevels in WGCNA. Default: turquoise
--outputType -ot Select export format, [ VisANT | cytoscape ]. Default: cytoscape
--nTop -ntop Export network cutoff: the number of the top gene to be filtered and exported in selected module. Default: 10000
--NetworkThreshold -nt Export Network cutoff: weight > NetworkThreshold. Default: 0.02
--GeneSign -gs Hub gene cutoff: significance of single gene-trait correlation > GeneSign. Default:0.1
--absdatKME -KME Hub gene cutoff: eigengene connectivity (kME) value > absdatKME Default:0.1
--qWeight -qw Weight cutoff. Default:0.9
--GOAnalysis -GO GO enrichment analysis, [ yes | no ]. Default: yes
--KEGGAnalysis -KEGG KEGG enrichment analysis, [ yes | no ]. Default: yes
--DOAnalysis -DO DO enrichment analysis, [ yes | no ]. Default: yes
--ontology -ontology GO Ontology of GO enrichment analysis, [ ALL | BP | MF | CC ]. Default: ALL
--pValue_cutoff_enrich -pen Enrichment analysis cutoff to p value. Default:0.05
--padj_cutoff_enrich -qen Enrichment analysis cutoff to q value. Default:0.1
--Biological_Condition -bc Choose one case or control, consistent with meta table. Default:Case1
--TreeMethod -tm TreeMethod, default, [RF (Random Forests) | ET (Extra-Trees)]. Default: RF
--nTrees -ntrees Number of trees in an ensemble for each target gene. Default:50
--min_cells - The minimum number of cells. Default:5
--max_cells - The maximum number of cells. Default:10000
--min_genes - The minimum number of genes. Default:200
--max_genes - The maximum number of genes. Default:4000
--max_mito - The maximum number of mito. Default:0.05
--PCnumber - The top PC number for find Neighbor. Default:15
--min_Resolution - The minimum Resolution for pre-find cluster. Default:0.6
--max_Resolution - The maximum Resolution for pre-find cluster. Default:1.2
--intervals_Resolution - The Resolution for pre-find cluster. Default:0.2
--Resolution - The Resolution for find cluster. Default:1
--method - [ tsne | umap ]. Default:tsne
--whichCluster - Choose one cluster for find marker and show on TSNE plot, the first cluster label as 0 . Default:1
--ref - Choose a reference, default [ 1-6 ], 1 HumanPrimaryCellAtlasData, 2 BlueprintEncodeData, 3 MonacoImmuneData, 4 NovershternHematopoieticData, 5 DatabaseImmuneCellExpressionData, 6 HumanPrimaryCellAtlasData and BlueprintEncodeData. Default:1
--annolevel - [ main | fine ]. Default:main