Multivariate entropy distance method for prokaryotic gene identification

J Bioinform Comput Biol. 2004 Jun;2(2):353-73. doi: 10.1142/s0219720004000624.

Abstract

A new simple method is found for efficient and accurate identification of coding sequences in prokaryotic genome. The method employs a Shannon description of artificial language for DNA sequences. It consists in translating a DNA sequence into a pseudo-amino acid sequence with 20 fundamental words according to the universal genetic code. With an entropy-density profile (EDP), the method maps a sequence of finite length to a vector and then analyzes its position in the 20-dimensional phase space depending on its nature. It is found that the ratio of the relative distance to an averaged coding and non-coding EDP over a small number (up to one) of open reading frames (ORFs) can serve as a good coding potential. An iterative algorithm is designed for finding a set of "root" sequences using this coding potential. A multivariate entropy distance (MED) algorithm is then proposed for the identification of prokaryotic genes; it has a feature to combine the use of a coding potential and an EDP-based sequence similarity analysis. The current version of MED is unsupervised, parameter-free and simple to implement. It is demonstrated to be able to detect 95-99% genes with 10-30% of additional genes when tested against the RefSeq database of NCBI and to detect 97.5-99.8% of confirmed genes with known functions. It is also shown to be able to find a set of (functionally known) genes that are missed by other well-known gene finding algorithms. All measurements show that the MED algorithm reaches a similar performance level as the algorithms like GeneMark and Glimmer for prokaryotic gene prediction.

Publication types

  • Comparative Study
  • Evaluation Study
  • Research Support, Non-U.S. Gov't
  • Validation Study

MeSH terms

  • Algorithms*
  • Amino Acids / genetics*
  • Base Sequence
  • DNA, Bacterial / genetics*
  • Gene Expression Profiling / methods*
  • Molecular Sequence Data
  • Multivariate Analysis
  • Open Reading Frames / genetics
  • Prokaryotic Cells
  • Sequence Alignment / methods*
  • Sequence Analysis, DNA / methods*
  • Sequence Homology, Nucleic Acid

Substances

  • Amino Acids
  • DNA, Bacterial