LncExpDB

Expression Database of Human Long non-coding RNAs

Introduction

1 What is LncExpDB?

LncExpDB is a comprehensive database for lncRNA expression. It covers expression profiles of lncRNA genes across various biological contexts, predicts potential functional lncRNAs and their interacting partners, and thus provides essential guidance on experimental design.

Based on comprehensive integration, stringent curation and systematic analysis, LncExpDB currently presents a collection of 101,293 high-quality human lncRNA genes and houses abundant expression profiles derived from 1,977 samples of 337 biological conditions across nine biological contexts. Consequently, LncExpDB estimates lncRNA genes’ expression reliability and capacities, identifies 25,191 featured genes, and further obtains 28,443,865 lncRNA-mRNA interactions.

Moreover, LncExpDB is equipped with user-friendly web interfaces, providing functionalities for data query, browsing, visualization as well as easy access.

Data and Methods

2.1 Data collection

LncExpDB collects a total of 24 RNA-seq datasets across 1,977 samples from GEO, SRA and ArrayExpress, covering 337 biological conditions of nine important biological contexts, including normal tissues/cell lines, organ development, preimplantation embryos, cell differentiation, subcellular localizations, exosomes, cancer cell lines, virus infection and circadian rhythm.

Biological Context Project ID Dataset Source Sample Number PMID
Normal Tissue/Cell E-MTAB-2836 The Human Protein Atlas EBI ArrayExpress 121 28940711
SRP013565 ENCODE Primary Cell Lines NCBI SRA 111 29126249
Organ Development E-MTAB-6814 Development of Brain EBI ArrayExpress 55 31243368
E-MTAB-6814 Development of Cerebellum EBI ArrayExpress 59 31243368
E-MTAB-6814 Development of Heart EBI ArrayExpress 50 31243368
E-MTAB-6814 Development of Kidney EBI ArrayExpress 40 31243368
E-MTAB-6814 Development of Liver EBI ArrayExpress 50 31243368
E-MTAB-6814 Development of Ovary EBI ArrayExpress 18 31243368
E-MTAB-6814 Development of Testis EBI ArrayExpress 41 31243368
Preimplantation Embryo GSE71318 Oocyte to Lateblastocyst (7 Stages) NCBI GEO 35 27315811
GSE36552 Oocyte to Lateblastocyst (9 Stages) NCBI GEO 90 23934149
Cell Differentiation GSE122380 Cell Differentiation NCBI GEO 297 31249060
Subcellular Localization GSE116008 Subcellular Localization NCBI GEO 36 31230715
Exosome GSE104926 Blood Exosomes from Early-Stage Esophageal Squamous Cell Carcinoma Patients vs. Normal Control NCBI GEO 12 32043367
GSE100063, GSE100206 Blood Exosomes from Colorectal Cancer Patients vs. Normal Control NCBI GEO 44 30053265
GSE100063, GSE100206 Blood Exosomes from Coronary Heart Disease vs. Normal Control NCBI GEO 38 30053265
GSE100063, GSE100206 Blood Exosomes from Hepatocellular Carcinoma vs. Normal Control NCBI GEO 53 30053265
GSE100063, GSE100206 Blood Exosomes from Pancreatic Adenocarcinoma Patients vs. Normal Control NCBI GEO 46 30053265
Cancer Cell Line PRJNA523380 Cancer Cell Line NCBI SRA 658 31068700
Virus Infection GSE125686 HIV Infection vs. Normal Control NCBI GEO 22 30185599
GSE125686 HBV Infection vs. Normal Control NCBI GEO 48 30185599
GSE125686 HCV Infection vs. Normal Control NCBI GEO 24 30185599
GSE147507 COVID Patients vs. Normal Control NCBI GEO 4 32416070
Circadian Rhythm GSE113883 Circadian Rhythm NCBI GEO 153 30201705
2.2 Data structure
2.3 LncRNA integration and curation

LncExpDB integrates human lncRNA transcripts from LncBook v1.2, RefLnc, GENCODE v33, CHESS v2.2, FANTOM-CAT (lv4_strigent) and BIGTranscriptome. To obtain a high-confidence lncRNA dataset, a set of strict criteria is adopted by considering redundancy, mapping error, possible pre-mRNA fragment, polymerase run-on, incomplete transcript, length, boundary, strand and coding potential.

2.4 Overlap among different resources at transcript levels

Overlap is defined as exact match of all exon junctions and 5'-start, 3'-end bounaries.

2.5 Assignment of lncRNA transcripts into gene loci

LncRNA transcripts are assigned into the same gene if they share exonic sequences in the same strand. Six cases of lncRNA genes are listed.

2.6 LncRNA classification

Based on their genomic locations in respect to protein-coding genes, we classified lncRNAs into seven groups, Intergenic, Intronic (S), Intronic (AS), Overlapping (S), Overlapping (AS), Sense, and Antisense. "S" in the bracket represents that lncRNAs are in the same strand of protein-coding RNAs, and "AS" represents that lncRNAs are in the antisense strand of protein-coding RNAs.

  • Intergenic: lncRNAs are transcribed from intergenic regions;
  • Intronic (S): lncRNAs are transcribed entirely from introns of protein-coding genes;
  • Intronic (AS): lncRNAs are transcribed from antisense strand of protein-coding genes and the entire sequences are covered by introns of protein-coding genes;
  • Overlapping (S): lncRNAs that contain coding genes within an intron on the sense strand;
  • Overlapping (AS): lncRNAs that contain coding genes within an intron on the antisense strand;
  • Sense: lncRNAs are transcribed from the sense strand of protein-coding genes and the entire sequence of lncRNAs are covered by protein-coding genes (Intronic lncRNAs are not included), or the entire sequence of protein-coding genes are covered by lncRNAs (Overlapping lncRNAs are not included), or both lncRNAs and protein-coding genes intersect each other partially;
  • Antisense: lncRNAs are transcribed from the antisense strand of protein-coding genes and the entire sequence of lncRNAs are covered by protein-coding genes (Intronic lncRNAs are not included), or the entire sequence of protein-coding genes are covered by lncRNAs (Overlapping lncRNAs are not included), or both lncRNAs and protein-coding genes intersect each other partially.
  • See more details in LncRNAWiki
  • 2.7 Read mapping, quantification and normalization

    All samples are processed by a standardized RNA-seq pipeline (Trimmomatic, FastQC, STAR, RSeQC, Kallisto and featureCounts) to get the abundance matrixes (reads count, CPM, FPKM and TPM) of lncRNAs. The raw abundance matrixes are normalized by TMM method.

    2.8 Estimation of transcription reliability

    LncExpDB considers lncRNA genes with maximum expression values less than 1.0 TPM in a certain biological condition as not expressed (NE). If the lncRNA genes are tagged with NE in all biological conditions available, they are most likely unreliable lncRNA genes. Of course, it is possible that this definition may change when novel biological conditions are covered.

    2.9 Estimation of lncRNA expression capacity

    All expressed genes are ranked in a specific condition (time point/stage/tissue/cell/component/processing). Specifically, genes with expression values greater than the upper quantile are classified as “H” (high expression level), those less than the lower quantile as “L” (low expression level), and the remaining as “M” (medium expression level). High-capacity lncRNAs (HCL) are genes with “H” classification in at least one condition, and low-capacity lncRNAs (LCL) are those with “L” in all conditions, and the remaining are medium-capacity lncRNAs (MCL). It is noted that with more biological conditions covered, LCL or MCL may change to MCL or HCL.

    LncExpDB identifies and characterizes featured lncRNA genes that are specifically expressed in a certain cell line/tissue, differentially expressed in the context of cancer or virus infection, enriched in a subcellular compartment, dynamically expressed during cell differentiation or embryo/organ development, or periodically expressed with circadian rhythm.

    The featured genes are identified using specialized methods with strict criteria:

  • Time-course expression patterns: R-square >= 0.7 and adjusted p-value < 0.05, maSigPro;
  • Stage-specific genes and tissue/cell-specific genes: τ (tissue-specific index) >= 0.9, maximum TPM >= 10;
  • Consistently expressed genes: τ <= 0.35, maximum TPM >= 10;
  • Differentially expressed genes in virus infection and exosomes: |log2 foldchange| >= 1 and adjusted p-value <= 0.05, DeSeq2;
  • Organelle-enriched genes: log2 foldchange >= 1 and adjusted p-value <= 0.05, DeSeq2;
  • Circadian genes: meta2d and BH < 0.05, MetaCycle.
  • 2.11 LncRNA-mRNA interaction prediction

    LncExpDB predicts lncRNA-mRNA interactions based on co-expression networks. Co-expressions relationships between lncRNAs and mRNAs are identified using the Pearson correlation coefficient (adjusted p-value < 0.01 and |r|>=0.5). It is noted due to the extremely small sampling size (n = 4), the dataset of “COVID patients vs. normal control” is not analyzed in this section.

    Database Usage

    3.1 Quick start in LncExpDB

    Enter a gene symbol or gene ID (LncExpDB ID) in the search box on the homepage to explore the lncRNA of interest. In the “Resources” part or “Context” section in the navigation bar, the click of each context will lead you to explore the expression profiles of featured lncRNAs and lncRNA-mRNA interactions across different biological conditions in the corresponding biological contexts, where you can view the defined featured genes or explore a group of lncRNA genes of interest with customized filtration.

    To overview expression capacities/featured genes/interactions across different contexts, please click on “Expression Capacity”, “Featured Genes” and “Interactions” in the navigation bar.

    3.2 Browse lncRNA genes in LncExpDB

    You can browse all lncRNAs in the "Genes" page with the basic information of gene id/symbol, classification, chromosome, strand, location, gene length and transcript number. You can search lncRNAs of interest by gene id/transcript id derived from LncBook v1.2, RefLnc, NONCODE v5, GENCODE v33, CHESS v2.2, FANTOM-CAT (lv4_strigent) and BIGTranscriptome or gene symbol derived from HGNC, chromosome or classification type, and the gene id is linked to detailed information page of expression profiles in different contexts. In the detaied gene page, all corresponding gene and transcript id provide hyperlinks to their orginal pages. In addition, users can view our reference gene track on UCSC Genome Browser.

    You can explore featured lncRNAs in the "Featured Genes" page, which covers tens of thousands of featured genes with specific expression patterns in at least one biological context. You can filter and/or re-order the table content using the categories and search boxes in the header line. Each gene id is linked to detailed information page of expression profiles in different contexts.

    3.4 View featured lncRNAs and interactions among biological contexts in LncExpDB

    You can view all types of biological samples in the "Contexts" page including normal tissues and cell lines, organ development, preimplantation embryos, cell differentiation, subcellular localization, exosome, cancer cell line, virus infection and circadian rhythm. Each context page contains the tabs of “Featured Genes” and “Interaction”.

    By clicking the tab of “Featured Genes”, you can select specific datasets of interest and browse all defined featured genes, e.g., specifically or consistently expressed genes in a certain context. In addition, you can select a specific group of genes with custom thresholds. You can filter and/or re-order the expression profile table using the categories and search boxes in the header line.

    By clicking the tab of “Interactions”, you can select specific datasets of interest and browse the cis or trans interactions between lncRNAs and mRNAs. Moreover, you can select a specific group of interaction by custom thresholds or search the related interactions by lncRNA/protein-coding id or symbol.

    3.5 Browse expression capacity in LncExpDB

    In the "Expression Capacity" page, you can browse the lncRNA’s expression capacity in various biological contexts. You can filter for high-capacity lncRNAs in one or multiple contexts using the categories and in the header line of expression capacity table. Furthermore, the “Chart” enables visualization of expression level distribution among all the biological conditions. Each gene id is linked to detailed information page of expression profiles in different contexts.

    3.6 Visualize lncRNA-mRNA interactions in LncExpDB

    You can visualize all lncRNA-mRNA interactions in the “Interactions” page, which includes the detail information of lncRNAs-mRNA pairs, pearson correlation coefficient value, p values and distance. The “search by” tab allows you to narrow down the results according to gene of your interest. Each gene id is linked to detailed information page of expression profiles in different contexts.

    3.7 Download data in LncExpDB

    The “Downloads” page contains all the files that you can download such as: i) reference gene model for RNA-seq analysis, ii) expression profiles, iii) expression levels, iv) featured genes and v) co-expression matrix in various biological contexts.

    3.8 Statistic results in LncExpDB

    In the page of “Statistics”, you can find and download all statistical analytics results for i) gene annotation statistics, such as lncRNA integration, exon and transcript number distribution and lncRNA classification, ii) expression statistics, including expression profiles and distribution of featured lncRNAs in different biological contexts, and iii) lncRNA-mRNA interaction distribution.

    Contact Us

    Email:
    lncexpdb(AT)big.ac.cn
    Postal Address
    The LncExpDB Team
    National Genomics Data Center
    China National Center for Bioinformation / Beijing Institute of Genomics, Chinese Academy of Sciences
    No.1 Beichen West Road
    Chaoyang District, Beijing 100101
    China
    Project Leader:
    Lina Ma: malina(AT)big.ac.cn
    LncRNA Integration:
    Zhao Li: lizhao2018m(AT)big.ac.cn
    Web Development:
    Zhao Li: lizhao2018m(AT)big.ac.cn
    Shuai Jiang: jiangs(AT)big.ac.cn
    Qiang Du: duqiang2019m(AT)big.ac.cn
    Dong Zou: zoud(AT)big.ac.cn
    Data collection:
    Lin Liu: liulin(AT)big.ac.cn
    Zhao Li: lizhao2018m(AT)big.ac.cn
    Expression Analysis:
    Lin Liu: liulin(AT)big.ac.cn
    Zhao Li: lizhao2018m(AT)big.ac.cn
    Qianpeng Li: liqianpeng2019d(AT)big.ac.cn
    Changrui Feng: fengchangrui2019m(AT)big.ac.cn
    Principal Investigator:
    Zhang Zhang: zhangzhang(AT)big.ac.cn
    Map: