Gene Expression Nebulas (GEN) is a data portal of gene expression profiles under various conditions
derived entirely from RNA-Seq data analysis in multiple species. It aims to facilitate the broad
research community dedicated to exploring functional genomics. As the initial step, GEN-1.0 is
released to provide a comprehensive transcriptomic and post-transcriptomic landscape of human
disease through ontology-based systematic integration of RNA sequencing data acquired from GEO, SRA,
EBI and GSA. GEN-1.0 provides user-friendly interfaces to access, visualize or further excavate the
curated gene expression data by implementing the functionalities of Browse, Search, Download,
JBrowse and a suite of optional tools.
Gene Expression Nebulas (GEN) is a data portal of gene expression profiles under various conditions
derived entirely from RNA-Seq data analysis in multiple species. It aims to facilitate the broad
research community dedicated to exploring functional genomics. As the initial step, GEN-1.0 is
released to provide a comprehensive transcriptomic and post-transcriptomic landscape of human
disease through ontology-based systematic integration of RNA sequencing data acquired from GEO, SRA,
EBI and GSA. GEN-1.0 provides user-friendly interfaces to access, visualize or further excavate the
curated gene expression data by implementing the functionalities of Browse, Search, Download,
JBrowse and a suite of optional tools.
RNA-seq data processing pipelines include raw data preprocessing (quality control), read alignments,
gene/transcript expression quantification (for both bulk and single cell RNA-seq data), cell
clustering and cell annotation (specific for single cell RNA-seq data).
Bulk RNA-seq data analysis – preprocessing and gene/transcript expression
quantification
First, low-quality RNA-seq reads are filtered by preprocessing steps using Fastp v0.20.0 (Chen et
al, 2018) and the strandness of RNA-seq library is inferred by RseQC v2.6.4 (Wang et al, 2012). Then
the high-quality RNA-seq reads are mapped to the reference genome Ensemble GRCh38 by STAR 2.7.1a
(Dobin et al, 2013). After read alignments, gene/isoform assembly and quantification are performed
using RSEM v1.3.1 (Li & Dewey, 2011) with default parameters, for basic expression profiling,
RawCount, RPKM and TPM are all calculated.
Citations:
Chen S, Zhou
Y, Chen Y, et al. fastp: an ultra-fast all-in-one FASTQ
preprocessor.
Bioinformatics.
2018,34(17):i884-i890. PMID: 30423086
Wang L, Wang
S, Li W. RSeQC: quality control of RNA-seq experiments.
Bioinformatics.
2012,28(16):2184-2185. PMID: 22743226
Dobin A,
Davis CA, Schlesinger F, et al. STAR: ultrafast universal RNA-seq
aligner.
Bioinformatics.
2013,29(1):15-21. PMID: 23104886
Li B, Dewey
CN. RSEM: accurate transcript quantification from RNA-Seq data with
or without a
reference
genome. BMC Bioinformatics. 2011,12:323. PMID: 21816040
Bulk RNA-seq data analysis – identification of RNA editing sites and quantification of RNA editing
level
Identification of RNA editing sites and quantification of RNA editing
levels are mainly executed by REDItools 2.0 REDItoolDenovo.py
(https://github.com/BioinfoUNIBA/REDItools2). Firstly, all candidate RNA editing sites are
identified and quantified based on read coverage and variation frequency at each site using Parallel
Strategy of REDItools 2.0. Secondly, candidate RNA editing sites located in Alu and non-Alu regions
are filtered by different parameter configurations, and annotated subsequently based on a variety of
annotation files such as gene annotation, RepeatMasker, SNP and known RNA editing information.
Thirdly, additional filtering criteria is used to obtain more accurate novel editing sites located
in non-Alu regions because non-Alu region usually has a narrow range of editing sites. To do this,
Pblat (Wang and Kong 2019) is used to detect the mismatched and multi-mapping reads, while Samtools
(Li et al. 2009) is used to delete duplicated reads. Finally, RNA editing sites are tagged as novel
or known sites. In the current version, RNA editing types of both A-to-I and C-to-U are
included.
Source of annotation files: (1) Gene annotation file:
GENCODE V33,
(2) RepeatMasker annotation file:
UCSC,
(3) SNP annotation files:
UCSC, (4) Known RNA editing sites:
REDIportal database ().
Genomic coordinates of the RepeatMasker file and the known RNA editing sites file are converted from
hg19 to hg38 using
UCSC
liftover.
Single cell RNA-seq data analysis – read alignment and generation of count matrix
Software used for read alignments depends on the type of data that will be analyzed. For drop-based single cell RNA-seq data, there is no quality control step in the upstream analysis. For protocols such as Smart-Seq, Smart-Seq2 and Fluidigm C1, processing pipeline is the same as bulk RNA-seq analysis, including fastp, RseQC and RSEM, except a special parameter ‘--single-cell-prior’ using Dirichlet (0.1) as the prior to calculate posterior mean estimates and credibility intervals in the RSEM step. ‘CellRanger’ is implemented for data generated from the 10X Genomics platform. The matrix counting step is slightly more complicated for drop-based single-cell data because the tools need to keep track of where each read came from (which cell and which transcript, if UMI were used). To obtain a matrix of read counts per gene is part of the alignment step, where rows usually correspond to genes and columns to cells.
Manual curation of metadata of all included RNA-seq datasets are done on 2 levels (‘Project’ and ‘Sample’) based on the structured curation model listed below.
Curation models on the ‘Project’ level
Items |
Curation Model |
Data Resource |
GEO, SRA, GSA |
Project ID |
Associated BioProject ID |
Project Name |
Title in GEO |
Summary |
Brief description of the project scheme |
Overall Design |
Experiment design, mainly including samples grouping |
PMID |
PubMed ID |
Release Date |
Release date of project data in data resource |
Submission Date |
Submission date of project data in original data resource
Update Date Update date of project data in data resource
Species Homo sapiens |
Tissue |
Brain, Liver, Skin, Kidney, etc |
Sample Number |
Number of samples included in the project |
Reference Genome |
Reference genome version, eg. GRCh38 v99 (including ERCC if needed) |
Genome Annotation |
Genome annotation version, eg. GRCh38 v99 (including ERCC if needed) |
Data Processing |
Fastp v0.20.0; fastp -q 20 -u 40 -l 50 -g -x -r -W 4 -M 20
RseQC v2.6.4; python infer_experiment.py -r $index -I $Projectid/$SRXid/_Alignment-unsorted.bam > $Projectid/$SRXid/_RseQC.txt
STAR v2.7.1a, RSEM v1.3.1; rsem-calculate-expression --single-cell-prior --keep-intermediate-files --temporary-folder $Projectid/$SRXid/_STAR_rsem --paired-end --forward-prob=$value -p $thread --time --star --star-path /software/biosoft/software/STAR-2.7/STAR/bin/Linux_x86_64/ --append-names --output-genome-bam --sort-bam-by-coordinate --star-gzipped-read-file $Projectid/$SRXid/_clean_1.fastq.gz $Projectid/$SRXid/_clean_2.fastq.gz $index $Projectid/$SRXid/_star_RSEM
|
Curation model on the ‘Sample’ level
Items |
Curation Model |
Basic Information |
Data Resource |
SRA,GSA |
Project ID |
SRA or GSA accession number |
Sample ID |
CRS… or GSM… |
Sample Name |
Brief description of the project scheme |
BioProject ID |
PRJNA… or PRJCA… |
BioSample ID |
SAMC… or SAMN… |
Sample Accession |
SRS… |
Experiment Accession |
CRX… or SRX… |
Release Date |
Release date of sample data in data resource |
Submission Date |
Submission date of sample data in data resource |
Update Date |
Update date of sample data in data resource |
Sequencing Strategy |
Bulk RNA-seq, Single cell RNA-seq |
Sequencing Platform |
Illumina |
Sample Characteristic |
Species |
Homo sapiens |
Race |
Race refers to a person’s physical characteristics, such as bone structure and skin, hair, or eye color. For example, American Indian, Asian, Black, Hispanic, White and etc.
|
Ethnicity |
Ethnicity refers to cultural factors, including nationality, regional culture, ancestry, and language. An example of ethnicity is German or Spanish ancestry or Han Chinese. |
Age |
The age of samples (patients, healthy donors, etc) |
Age unit |
Year |
Gender |
Male, female and etc. |
Source Name |
Name of each sample group |
Tissue |
Brain, Liver, Skin, Kidney, etc |
Cell type |
T cell, B cell … |
Cell Subtype |
Cell subtype or cell population |
Cell Line |
CB660, H358, 501 mel, etc |
Biological Condition |
Disease |
Cancer, autoimmune disease … |
Disease State |
Number of samples of case condition |
Development Stage |
Number of samples of control condition |
Mutation |
Related gene mutation |
Phenotype |
Symptoms of diseases |
Experimental Variables |
Case/Control |
Case/Control grouping |
Case detail |
Case details to distinguish between case and control |
Control detail |
Control details to distinguish between case and control |
Protocol |
Growth Protocol |
Culture protocols of cells from samples or cell lines |
Treatment Protocol |
Protocols of sample treatment |
Extract Protocol |
The extract protocols of RNA |
Library Construction Protocol |
The protocols of RNA sequencing library construction |
Molecule Type |
Total RNA, polyA RNA, RNA RiboZero |
Library Layout |
PAIRED, SINGLE |
Strand-Specific |
Specific, Unspecific |
Library Strand |
Reverse means First strand, Forward means Second strand, and dash (-) means strand-unspecific |
Spike-in |
ERCC or - |
Sequencing Technology |
Strategy |
Bulk RNA-seq, scRNA 10X, scRNA SMARTer, etc |
Platform |
Illumina, BGISEQ, etc |
Instrument Model |
Illumina HiSeq 2000, Illumina NextSeq 500, BGISEQ-500, Illumina NextSeq 500, etc |
Assessing Quality |
#Cell |
The estimated number of cells |
#Reads |
The number of reads in fastq file |
Gbases |
Total bases after filtering |
AvgSpotLen1 |
Average spot1 length (after filtering if filtered) |
AvgSpotLen2 |
Average spot2 length (after filtering if filtered) |
Unique-Mapping Rate |
Percent of uniquely mapped reads |
Multi-Mapping Rate |
Percent of multi-mapped reads |
Coverage Rate |
total mapped reads number*Average read length/total bases of reference genome |
Analysis |
Reference Genome |
Reference genome version, eg. GRCh38 v99 (including ERCC if needed) |
Genome Annotation |
Genome annotation version, eg. GRCh38 v99 (including ERCC if needed) |
Additional information |
Additional information |
Supplementary information |
Extended Gene Summary
To enrich and expand the dimension of gene function annotation, a series of valuable information are provided to each gene (if available) for users’ reference. Currently, Entrez ID, HGNC ID, Refseq ID, Symbol, GC Content, Housekeeping or Tissue-Specific gene, associated phenotype, related PMID accession number, external link to GeneCard (https://www.genecards.org/), internal link to EDK (https://bigd.big.ac.cn/edk) and ICG (http://icg.big.ac.cn/index.php/Homo_sapiens), Gene Ontology, Disease Ontology, and gene structure visualization on Genome Browser are presented as gene summary items.
Definition of Tissue-specific (TS) and Housekeeping (HK) gene
Housekeeping genes and tissue-specific genes are defined based on the expression profile derived from GTEx portal (the Genotype-Tissue Expression, 53 normal human tissues are covered). Genes with highest expression values across all tissues of lower than 0.5 TPM/FPKM are filtered out. Then, tissue specificity index τ-value and CV (coefficient of variance) value are used to determine housekeeping genes (HK, τ-value <= 0.5 and CV <= 0.5) and tissue-specific genes (TS, τ-value >= 0.95).
The index τ value is defined as: $$, where N is the number of tissues and is the expression profile component normalized by the maximal component value. CV is abbreviated from coefficient of variation, which stands for the fluctuation of gene expression levels across tissues. The coefficient of variation (CV) is defined as the ratio of the standard deviation to the mean : .
Reference of gene expression and RNA editing profiles derived from GTEx and REDIportal
Reference gene expression level and RNA editing level across normal human tissues or body sites are obtained from GTEx (https://www.gtexportal.org) and REDIportal (http://srv00.recas.ba.infn.it/atlas/), respectively. To provide an overview of the general expression and RNA editing pattern, the ‘Average’, ‘Median’, ‘Maximum’, ‘Minimum’ expression and RNA editing levels across 53 tissues, the ‘CV’ value, ‘τ value’ and ‘Expression Breadth’ of each gene are indicated in the ‘Gene Summary’ section and can also be used as options for filtering gene.