LncBook is a curated knowledgebase of human lncRNAs that features a comprehensive collection of human lncRNAs and systematic curation of lncRNAs by multi-omics data integration, functional annotation and disease association.
The current implementation of LncBook houses a large number of 270,044 lncRNAs and includes 1,867 featured lncRNAs with 3,762 lncRNA-function associations. It also integrates an abundance of multi-omics data from expression, methylation, genome variation and lncRNA-miRNA interaction. In addition, LncBook includes 3,772 experimentally validated lncRNA-disease associations and identifies a total of 97,998 lncRNAs that are putatively disease-associated. To facilitate online analysis, a series of useful tools such as coding potential prediction, sequence search, etc., are deployed in LncBook.
LncRNAWiki is a community-curation-based human lncRNA knowledgebase, while LncBook is an expert-curation-based database.
We built LncRNAWiki in 2015 to harness collective intelligence for collecting and annotating human lncRNAs. However, LncRNAWiki is built based on MediaWiki, and has significant limitations on managing structured data and achieving customized functionalities. It would be desirable to organize large-scale annotations in a structured manner and to provide customized web functionalities with more friendly interfaces. More importantly, it is highly beneficial to integrate multi-omics data with the aim to significantly enrich and improve lncRNAs’ annotations. For these reasons, we constructed LncBook.
LncBook includes community-contributed annotations from LncRNAWiki, and LncRNAWiki can be considered as an important component of LncBook.
The rapid advancement in DNA sequencing technologies has led to an exponential increase in the number of human lncRNAs. To provide up-to-date, comprehensive, and high-quality annotation of human lncRNAs, we employ both community curation and expert curation.
LncRNAWiki, which uses wiki-based community curation model, is publicly editable and the content is up-to-data, bearing great promise in dealing with the flood of biological knowledge. However, it is difficult to guarantee the quality of community curation, and MediaWiki has limitations on managing structured data and achieving fundamental biological functionalities. We therefore build the expert-curation-based database LncBook, which is capable of:
We collect existing human lncRNAs from GENCODE v27, NONCODE v5.0, LNCipedia v4.1, MiTranscriptome beta. To obtain high-confidence lncRNAs, a set of strict criteria were adopted by considering redundancy, background noise, mapping error, incomplete transcript, length, and coding potential.
One the other hand, we integrated the experimentally validated lncRNAs, which are sourced from LncRNAWiki. The RefSeq and Ensembl references were obtained from HGNC to enable genomic location comparison. However, half of these lncRNAs are presently unable to be traced back to their genomic locations and thus are not included.
Finally, we obtained a large collection of 247,246 existing lncRNAs.
We identified novel lncRNAs based on the 122 RNA-Seq data from Human Protein Atlas. The transcript assemblies were compared against the existing lncRNAs integrated from multiple databases, and novel lncRNA transcripts were then curated with the strict criteria used during the curation of existing lncRNA. Finally, 21,815 novel lncRNAs were identified.
We filtered lncRNA transcripts by considering redundancy, background noise, mapping error, incomplete transcript, length, and coding potential. Cuffcompare was used to compare different datasets during data integration. Transcripts tagged with the class code ‘=’ were considered as redundancies. Transcripts tagged with the class code ‘e’, ‘s’ and ‘p’ are considered to be background noises or mapping errors and these lncRNAs were removed. Also, we removed incomplete transcripts. Transcript whose first exon or last exon is shorter than 15nt, and the single exon transcript located within exon region of the multiple-exon transcript in the same strand, were considered to be incomplete transcripts. We also filtered out transcripts that are shorter than 200nt, and predicted lncRNAs using different algorithms. We predicted coding potential using LGC, CPAT and PLEK, and transcripts identified as lncRNAs by all the three algorithms were retained.
LncRNA transcripts and genes are assigned LncBook accession numbers. Unique accession number prefixed with HSALNT is assigned to each lncRNA transcript entity. Likewise, lncRNA gene has an accession number prefixed with HSALNG.
To facilitate bioinformatics analysis and functional study of lncRNAs, we performed large-scale multi-omics analysis and provided tissue expression, methylation, variation, and lncRNA-miRNA information in LncBook.
LncBook incorporates 3,772 experimentally validated lncRNA-disease associations and further identifies a total of 97,998 lncRNAs that are putatively disease-associated.
The associations between lncRNA and disease were derived from LncRNADisease and LncRNAWiki, which have been curated from published literatures with experimental evidence. Each lncRNA-disease association was annotated with disease name, MeSH Ontology (Medical Subject Headings 2018 name), dysfunction type, detailed description, and publication. Among the 1,867 featured lncRNAs (experimentally validated lncRNAs), 1,502 lncRNAs are associated with 462 diseases.
On the other hand, we predicted disease-associated lncRNAs according to the evidence from methylation, genome variation, and lncRNA-miRNA interaction: (1) Methylation: In each cancer, lncRNA whose promoter region methylation level shows increase (decrease) in 80% cancer samples relative to normal sample is considered to be hypermethylated (hypomethylated). Thus, one lncRNA is considered to be cancer-associated if it is consistently hypermethylated or hypomethylated in at least eight cancers; (2) Genome Variation: any lncRNA overlapping disease-associated SNPs of COSMIC (OCCURRENCE >= 3) or ClinVar in its genomic location is considered to be disease-associated; (3) Interaction: any lncRNA interacting with at least 11 disease-associated miRNAs (associated with at least 5 diseases according to the Human microRNA Disease Database HMDD) is considered to be disease-associated.
However, of note, this does not mean that these disease-associated lncRNAs play causative roles in diseases. We plan to do further curation by identifying lncRNAs that play functional causative roles, which would be more significant and useful.
We systematically curated 1,653 lncRNAs (that were sourced from LncRNAWiki) with functional annotations reported in 2,501 publications and described their functional mechanisms and biological processes with controlled vocabularies.
To profile expression levels for lncRNAs, two RNA-Seq datasets were used: HPA (the Human Protein Atlas, 32 normal human tissues are covered) and GTEx (the Genotype-Tissue Expression, 53 normal human tissues are covered). We filtered out lncRNAs whose highest expression values are lower than 0.5 TPM/FPKM. Then, τ-value and cv (coefficient of variance) value were used to determine HK (housekeeping) lncRNAs (τ-value <= 0.5 and cv <= 0.5) and TS (tissue-specific) lncRNAs (τ-value >= 0.95).
The predicted lncRNAs (lncRNAs integrated from other databases, and novel lncRNAs predicted based on RNA-Seq data analysis) were filtered according to the coding potential prediction results of the tools CPAT, PLEK, and LGC. Those identified as lncRNAs by all the three tools were retained. However, all experimentally validated lncRNAs were integrated regardless of their coding potential.
Based on their genomic locations in respect to protein-coding genes, we classified lncRNAs into seven groups, Intergenic, Intronic (S), Intronic (AS), Overlapping (S), Overlapping (AS), Sense, and Antisense. "S" in the bracket represents that lncRNAs are in the same strand of protein-coding RNAs, and "AS" represents that lncRNAs are in the antisense strand of protein-coding RNAs.
Users can search by different terms in the homepage:
Also, it is easy to search specific lncRNA or a specific group of lncRNAs in each resource, e,g,, “Disease”, “Function”, “Expression”, “Methylation”.
LncBook provides a series of tools to facilitate online analysis, BLAST, ID conversion, coding potential prediction, and genomic positional annotation. The coding potential prediction tool, LGC, is developed by our team, and the manuscript has been submitted. If you want to use this tool in your work, please cite the paper in bioRxiv.
Some featured lncRNAs presently can not be traced back to their genomic locations, and thus were not annotated with multi-omics data. We do not create specific transcript pages for them.
You can access all the experimentally validated lncRNAs directly by clicking “Featured lncRNAs” in the homepage. There are 1,867 featured lncRNAs, which have been curated about their functional mechanisms and biological processes, and are accordingly linked to different publications. Click the Transcript ID will direct you to the transcript page, which contains all the annotated information of the lncRNA transcript including genomic location, tissue expression profile, methylation level, variation, lncRNAs-miRNA interaction, function, and disease. At the same time, click the symbol will direct you to the wiki annotation page, where community annotations are shown.
Also, you can access all the associated publications of these lncRNAs in the page of “Featured lncRNAs” by clicking “Literature” on the top right corner. Title, publication date, journal, citation, gene name and PMID are shown in a table.
You can access the 3,772 experimentally validated lncRNA-disease associations and 97,998 lncRNAs that are putatively disease-associated in the page of “Diseases”.
The experimentally validated lncRNA-disease associations can be accessed by clicking “Validated”. In the page of “Validated”, you can search by “Disease” or “Mesh Ontology”.
In the page of “Predicted”, you can browse the predicted disease-associated lncRNAs, which were predicted based on the pathogenic evidence from methylation, variation, and interaction. These lncRNAs were grouped into different grades, and the number of stars indicates the amount of evidence.
You can access all the lncRNA-function associations directly by clicking “Function” in the homepage. In the “Function” page, you can search by “Functional Mechanism” or “Biological Process”, or other items.
In the “Expression” page, you can specify the dataset, tissue breadth, cv (coefficient of variation) and τ-value of interest and then you will find all relevant information associated with these conditions.
You can access the methylation profile directly by clicking “Methylation” in the homepage. The mean methylation level (promoter region) of all cancer samples is shown in the table. To view promoter or body region methylation level distribution among both normal and cancer samples, you can click “Promoter Chart” or “Body Chart”.
You can access all the SNPs (derived from dbSNP) of lncRNAs directly by clicking “Variation” in the homepage. SNPs can be browsed by specifying genomic location or dbSNP ID.
You can access all the lncRNA-miRNA interactions directly by clicking “Interaction”. Experimental evidence was obtained from StarBase. “Interaction” can be browsed by miRNA or lncRNA ID.
Go to the “Download” page to download the data. We recommend users to use Chrome, Firefox, and other browsers, because the FTP is incompatible with Safari browser in some cases and some download files may not be displayed in the download page.
LncBook: a curated knowledgebase of human long non-coding RNAs. Nucleic Acids Res 2019, in press.
LncRNAWiki: harnessing community knowledge in collaborative curation of human long non-coding RNAs. Nucleic Acids Res 2015. [PMID=25399417]
The BIG Data Center: from deposition to integration to translation. Nucleic Acids Res 2017. [PMID=27899658]
Database Resources of the BIG Data Center in 2018. Nucleic Acids Res 2018. [PMID=29036542]