Sequence clustering strategies improve remote homology recognitions while reducing search times

Weizhong Li; Lukasz Jaroszewski; Adam Godzik

doi:10.1093/protein/15.8.643

Sequence clustering strategies improve remote homology recognitions while reducing search times

Protein Eng. 2002 Aug;15(8):643-9. doi: 10.1093/protein/15.8.643.

Authors

Weizhong Li¹, Lukasz Jaroszewski, Adam Godzik

Affiliation

¹ The Burnham Institute, La Jolla, CA 92037, USA.

PMID: 12364578
DOI: 10.1093/protein/15.8.643

Abstract

Sequence databases are rapidly growing, thereby increasing the coverage of protein sequence space, but this coverage is uneven because most sequencing efforts have concentrated on a small number of organisms. The resulting granularity of sequence space creates many problems for profile-based sequence comparison programs. In this paper, we suggest several strategies that address these problems, and at the same time speed up the searches for homologous proteins and improve the ability of profile methods to recognize distant homologies. One of our strategies combines database clustering, which removes highly redundant sequence, and a two-step PSI-BLAST (PDB-BLAST), which separates sequence spaces of profile composition and space of homology searching. The combination of these strategies improves distant homology recognitions by more than 100%, while using only 10% of the CPU time of the standard PSI-BLAST search. Another method, intermediate profile searches, allows for the exploration of additional search directions that are normally dominated by large protein sub-families within very diverse families. All methods are evaluated with a large fold-recognition benchmark.

Publication types

Comparative Study
Research Support, U.S. Gov't, P.H.S.

MeSH terms

Amino Acid Motifs
Amino Acid Sequence
Cluster Analysis*
Databases, Protein*
Models, Molecular
Molecular Sequence Data
Protein Folding
Protein Structure, Tertiary
Proteins / chemistry*
Proteins / classification
Sequence Alignment
Sequence Homology, Amino Acid*
Software

Substances

Proteins

Grants and funding

GM60049/GM/NIGMS NIH HHS/United States