LGC A Feature Relationship Based Algorithm to Distinguish Protein Coding and Long Non-coding RNAs


Considering only a quite limited number of organisms have high-quality genome sequencing and gene annotation, computational identification of lncRNA in species with less or even no appropriate training data remains a major challenge. Here, we present an algorithm, LGC, which uses geometric distribution to model the relationship between ORF (Open Reading Frame) length and GC content and is able to distinguish lncRNAs from protein-coding RNAs accurately without specific training. We demonstrate the accuracy and robustness of LGC in nine representative organisms. LGC performs accurately (average accuracy: 0.929) and robustly (variance of accuracies: 4.54*10-4) on the nine representative organisms: human (sensitivity: 0.964, specificity: 0.921), mouse (sensitivity: 0.96, specificity: 0.969), rat (sensitivity: 0.927, specificity: 0.858), zebra fish (sensitivity: 0.939, specificity: 0.882), fly (sensitivity: 0.908, specificity: 0.913), worm (sensitivity: 0.908, specificity: 0.996), rice (sensitivity: 0.909, specificity: 0.953), tomato (sensitivity: 0.882, specificity: 0.992) and potato (sensitivity: 0.844, specificity: 0.995). Furthermore, LGC runs at high speed and could be easily configured for multi-thread parallel computing to process large-scale transcriptome data quickly. Overall, LGC is simple, reliable, fast, and easy to use, and we believe it will become a practical tool and play an important role in the exploration of large quantity of novel lncRNAs.


No Publication Information


No Credits Information

Community Ratings

UsabilityEfficiencyReliabilityRated By
1 users
Sign in to rate
dot***g@gmail.com (July 12, 2017)
Tool TypeToolkit
CategoryProtein-coding gene prediction
TechnologiesJava, Python2
User InterfaceTerminal Command Line, Webpage
Input DataFASTA
Latest Release1.0 (June 26, 2017)
Download Count11
Submitted Bywangfan@big.ac.cn