Application of string kernels in protein sequence classification

Appl Bioinformatics. 2005;4(1):45-52. doi: 10.2165/00822942-200504010-00005.

Abstract

Introduction: The production of biological information has become much greater than its consumption. The key issue now is how to organise and manage the huge amount of novel information to facilitate access to this useful and important biological information. One core problem in classifying biological information is the annotation of new protein sequences with structural and functional features.

Method: This article introduces the application of string kernels in classifying protein sequences into homogeneous families. A string kernel approach used in conjunction with support vector machines has been shown to achieve good performance in text categorisation tasks. We evaluated and analysed the performance of this approach, and we present experimental results on three selected families from the SCOP (Structural Classification of Proteins) database. We then compared the overall performance of this method with the existing protein classification methods on benchmark SCOP datasets.

Results: According to the F1 performance measure and the rate of false positive (RFP) measure, the string kernel method performs well in classifying protein sequences. The method outperformed all the generative-based methods and is comparable with the SVM-Fisher method.

Discussion: Although the string kernel approach makes no use of prior biological knowledge, it still captures sufficient biological information to enable it to outperform some of the state-of-the-art methods.

Publication types

  • Evaluation Study
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Amino Acid Motifs
  • Amino Acid Sequence
  • Artificial Intelligence*
  • Cluster Analysis
  • Conserved Sequence
  • Molecular Sequence Data
  • Pattern Recognition, Automated / methods*
  • Proteins / analysis
  • Proteins / chemistry*
  • Proteins / classification*
  • Sequence Alignment / methods*
  • Sequence Analysis, Protein / methods*
  • Sequence Homology, Amino Acid

Substances

  • Proteins