Gene Expression Nebulas

Documentation

Introduction
Gene Expression Nebulas (GEN) is a data portal of gene expression profiles under various conditions derived entirely from RNA-Seq data analysis in multiple species. It aims to facilitate the broad research community dedicated to exploring functional genomics. As the initial step, GEN-1.0 is released to provide a comprehensive transcriptomic and post-transcriptomic landscape of human disease through ontology-based systematic integration of RNA sequencing data acquired from GEO, SRA, EBI and GSA. GEN-1.0 provides user-friendly interfaces to access, visualize or further excavate the curated gene expression data by implementing the functionalities of Browse, Search, Download, JBrowse and a suite of optional tools.
Data Collection
Gene Expression Nebulas (GEN) is a data portal of gene expression profiles under various conditions derived entirely from RNA-Seq data analysis in multiple species. It aims to facilitate the broad research community dedicated to exploring functional genomics. As the initial step, GEN-1.0 is released to provide a comprehensive transcriptomic and post-transcriptomic landscape of human disease through ontology-based systematic integration of RNA sequencing data acquired from GEO, SRA, EBI and GSA. GEN-1.0 provides user-friendly interfaces to access, visualize or further excavate the curated gene expression data by implementing the functionalities of Browse, Search, Download, JBrowse and a suite of optional tools.
Data Processing

RNA-seq data processing pipelines include raw data preprocessing (quality control), read alignments, gene/transcript expression quantification (for both bulk and single cell RNA-seq data), cell clustering and cell annotation (specific for single cell RNA-seq data).

Bulk RNA-seq data analysis – preprocessing and gene/transcript expression quantification
First, low-quality RNA-seq reads are filtered by preprocessing steps using Fastp v0.20.0 (Chen et al, 2018) and the strandness of RNA-seq library is inferred by RseQC v2.6.4 (Wang et al, 2012). Then the high-quality RNA-seq reads are mapped to the reference genome Ensemble GRCh38 by STAR 2.7.1a (Dobin et al, 2013). After read alignments, gene/isoform assembly and quantification are performed using RSEM v1.3.1 (Li & Dewey, 2011) with default parameters, for basic expression profiling, RawCount, RPKM and TPM are all calculated.
Bulk RNA-seq data analysis – identification of RNA editing sites and quantification of RNA editing level
Identification of RNA editing sites and quantification of RNA editing levels are mainly executed by REDItools 2.0 REDItoolDenovo.py (https://github.com/BioinfoUNIBA/REDItools2). Firstly, all candidate RNA editing sites are identified and quantified based on read coverage and variation frequency at each site using Parallel Strategy of REDItools 2.0. Secondly, candidate RNA editing sites located in Alu and non-Alu regions are filtered by different parameter configurations, and annotated subsequently based on a variety of annotation files such as gene annotation, RepeatMasker, SNP and known RNA editing information. Thirdly, additional filtering criteria is used to obtain more accurate novel editing sites located in non-Alu regions because non-Alu region usually has a narrow range of editing sites. To do this, Pblat (Wang and Kong 2019) is used to detect the mismatched and multi-mapping reads, while Samtools (Li et al. 2009) is used to delete duplicated reads. Finally, RNA editing sites are tagged as novel or known sites. In the current version, RNA editing types of both A-to-I and C-to-U are included.
Source of annotation files: (1) Gene annotation file: GENCODE V33, (2) RepeatMasker annotation file: UCSC, (3) SNP annotation files: UCSC, (4) Known RNA editing sites: REDIportal database (). Genomic coordinates of the RepeatMasker file and the known RNA editing sites file are converted from hg19 to hg38 using UCSC liftover.
Single cell RNA-seq data analysis – read alignment and generation of count matrix
Software used for read alignments depends on the type of data that will be analyzed. For drop-based single cell RNA-seq data, there is no quality control step in the upstream analysis. For protocols such as Smart-Seq, Smart-Seq2 and Fluidigm C1, processing pipeline is the same as bulk RNA-seq analysis, including fastp, RseQC and RSEM, except a special parameter ‘--single-cell-prior’ using Dirichlet (0.1) as the prior to calculate posterior mean estimates and credibility intervals in the RSEM step. ‘CellRanger’ is implemented for data generated from the 10X Genomics platform. The matrix counting step is slightly more complicated for drop-based single-cell data because the tools need to keep track of where each read came from (which cell and which transcript, if UMI were used). To obtain a matrix of read counts per gene is part of the alignment step, where rows usually correspond to genes and columns to cells.
Metadata Curation
Manual curation of metadata of all included RNA-seq datasets are done on 2 levels (‘Project’ and ‘Sample’) based on the structured curation model listed below.
Curation models on the ‘Project’ level
Items Curation Model
Data Resource GEO, SRA, GSA
Project ID Associated BioProject ID
Project Name Title in GEO
Summary Brief description of the project scheme
Overall Design Experiment design, mainly including samples grouping
PMID PubMed ID
Release Date Release date of project data in data resource
Submission Date Submission date of project data in original data resource Update Date Update date of project data in data resource Species Homo sapiens
Tissue Brain, Liver, Skin, Kidney, etc
Sample Number Number of samples included in the project
Reference Genome Reference genome version, eg. GRCh38 v99 (including ERCC if needed)
Genome Annotation Genome annotation version, eg. GRCh38 v99 (including ERCC if needed)
Data Processing
Fastp v0.20.0; fastp -q 20 -u 40 -l 50 -g -x -r -W 4 -M 20
RseQC v2.6.4; python infer_experiment.py -r $index -I $Projectid/$SRXid/_Alignment-unsorted.bam > $Projectid/$SRXid/_RseQC.txt
STAR v2.7.1a, RSEM v1.3.1; rsem-calculate-expression --single-cell-prior --keep-intermediate-files --temporary-folder $Projectid/$SRXid/_STAR_rsem --paired-end --forward-prob=$value -p $thread --time --star --star-path /software/biosoft/software/STAR-2.7/STAR/bin/Linux_x86_64/ --append-names --output-genome-bam --sort-bam-by-coordinate --star-gzipped-read-file $Projectid/$SRXid/_clean_1.fastq.gz $Projectid/$SRXid/_clean_2.fastq.gz $index $Projectid/$SRXid/_star_RSEM
Curation model on the ‘Sample’ level
Items Curation Model
Basic Information
Data Resource SRA,GSA
Project ID SRA or GSA accession number
Sample ID CRS… or GSM…
Sample Name Brief description of the project scheme
BioProject ID PRJNA… or PRJCA…
BioSample ID SAMC… or SAMN…
Sample Accession SRS…
Experiment Accession CRX… or SRX…
Release Date Release date of sample data in data resource
Submission Date Submission date of sample data in data resource
Update Date Update date of sample data in data resource
Sequencing Strategy Bulk RNA-seq, Single cell RNA-seq
Sequencing Platform Illumina
Sample Characteristic
Species Homo sapiens
Race Race refers to a person’s physical characteristics, such as bone structure and skin, hair, or eye color. For example, American Indian, Asian, Black, Hispanic, White and etc.
Ethnicity Ethnicity refers to cultural factors, including nationality, regional culture, ancestry, and language. An example of ethnicity is German or Spanish ancestry or Han Chinese.
Age The age of samples (patients, healthy donors, etc)
Age unit Year
Gender Male, female and etc.
Source Name Name of each sample group
Tissue Brain, Liver, Skin, Kidney, etc
Cell type T cell, B cell …
Cell Subtype Cell subtype or cell population
Cell Line CB660, H358, 501 mel, etc
Biological Condition
Disease Cancer, autoimmune disease …
Disease State Number of samples of case condition
Development Stage Number of samples of control condition
Mutation Related gene mutation
Phenotype Symptoms of diseases
Experimental Variables
Case/Control Case/Control grouping
Case detail Case details to distinguish between case and control
Control detail Control details to distinguish between case and control
Protocol
Growth Protocol Culture protocols of cells from samples or cell lines
Treatment Protocol Protocols of sample treatment
Extract Protocol The extract protocols of RNA
Library Construction Protocol The protocols of RNA sequencing library construction
Molecule Type Total RNA, polyA RNA, RNA RiboZero
Library Layout PAIRED, SINGLE
Strand-Specific Specific, Unspecific
Library Strand Reverse means First strand, Forward means Second strand, and dash (-) means strand-unspecific
Spike-in ERCC or -
Sequencing Technology
Strategy Bulk RNA-seq, scRNA 10X, scRNA SMARTer, etc
Platform Illumina, BGISEQ, etc
Instrument Model Illumina HiSeq 2000, Illumina NextSeq 500, BGISEQ-500, Illumina NextSeq 500, etc
Assessing Quality
#Cell The estimated number of cells
#Reads The number of reads in fastq file
Gbases Total bases after filtering
AvgSpotLen1 Average spot1 length (after filtering if filtered)
AvgSpotLen2 Average spot2 length (after filtering if filtered)
Unique-Mapping Rate Percent of uniquely mapped reads
Multi-Mapping Rate Percent of multi-mapped reads
Coverage Rate total mapped reads number*Average read length/total bases of reference genome
Analysis
Reference Genome Reference genome version, eg. GRCh38 v99 (including ERCC if needed)
Genome Annotation Genome annotation version, eg. GRCh38 v99 (including ERCC if needed)
Additional information
Additional information Supplementary information
Feature-rich Gene Annotation
Extended Gene Summary
To enrich and expand the dimension of gene function annotation, a series of valuable information are provided to each gene (if available) for users’ reference. Currently, Entrez ID, HGNC ID, Refseq ID, Symbol, GC Content, Housekeeping or Tissue-Specific gene, associated phenotype, related PMID accession number, external link to GeneCard (https://www.genecards.org/), internal link to EDK (https://bigd.big.ac.cn/edk) and ICG (http://icg.big.ac.cn/index.php/Homo_sapiens), Gene Ontology, Disease Ontology, and gene structure visualization on Genome Browser are presented as gene summary items.
Definition of Tissue-specific (TS) and Housekeeping (HK) gene
Housekeeping genes and tissue-specific genes are defined based on the expression profile derived from GTEx portal (the Genotype-Tissue Expression, 53 normal human tissues are covered). Genes with highest expression values across all tissues of lower than 0.5 TPM/FPKM are filtered out. Then, tissue specificity index τ-value and CV (coefficient of variance) value are used to determine housekeeping genes (HK, τ-value <= 0.5 and CV <= 0.5) and tissue-specific genes (TS, τ-value >= 0.95).
The index τ value is defined as: $$, where N is the number of tissues and is the expression profile component normalized by the maximal component value. CV is abbreviated from coefficient of variation, which stands for the fluctuation of gene expression levels across tissues. The coefficient of variation (CV) is defined as the ratio of the standard deviation to the mean : .
Reference of gene expression and RNA editing profiles derived from GTEx and REDIportal
Reference gene expression level and RNA editing level across normal human tissues or body sites are obtained from GTEx (https://www.gtexportal.org) and REDIportal (http://srv00.recas.ba.infn.it/atlas/), respectively. To provide an overview of the general expression and RNA editing pattern, the ‘Average’, ‘Median’, ‘Maximum’, ‘Minimum’ expression and RNA editing levels across 53 tissues, the ‘CV’ value, ‘τ value’ and ‘Expression Breadth’ of each gene are indicated in the ‘Gene Summary’ section and can also be used as options for filtering gene.
Gene expression profile across 53 normal human tissues is downloaded from GTEx at: https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data/GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_reads.gct.gz Main annotation table of all known and predicted editing sites is downloaded from REDIportal at: http://srv00.recas.ba.infn.it/webshare/rediportalDownload/table1_full.txt.gz RNA editing profile across normal human tissues is downloaded from REDIportal at: http://srv00.recas.ba.infn.it/webshare/rediportalDownload/table2_full.txt.gz