Considering only a quite limited number of organisms have high-quality genome sequencing and gene annotation, computational identification of lncRNA in species with less or even no appropriate training data remains a major challenge. Here, we present an algorithm, LGC, which uses geometric distribution to model the relationship between ORF (Open Reading Frame) length and GC content and is able to distinguish lncRNAs from protein-coding RNAs accurately without specific training. We demonstrate the accuracy and robustness of LGC in nine representative organisms. LGC performs accurately (average accuracy: 0.929) and robustly (variance of accuracies: 4.54*10-4) on the nine representative organisms: human (sensitivity: 0.964, specificity: 0.921), mouse (sensitivity: 0.96, specificity: 0.969), rat (sensitivity: 0.927, specificity: 0.858), zebra fish (sensitivity: 0.939, specificity: 0.882), fly (sensitivity: 0.908, specificity: 0.913), worm (sensitivity: 0.908, specificity: 0.996), rice (sensitivity: 0.909, specificity: 0.953), tomato (sensitivity: 0.882, specificity: 0.992) and potato (sensitivity: 0.844, specificity: 0.995). Furthermore, LGC runs at high speed and could be easily configured for multi-thread parallel computing to process large-scale transcriptome data quickly. Overall, LGC is simple, reliable, fast, and easy to use, and we believe it will become a practical tool and play an important role in the exploration of large quantity of novel lncRNAs.
No Publication Information
No Credits Information
|Sign in to rate|
|firstname.lastname@example.org (July 12, 2017)|