Machine learning on normalized protein sequences

Heider, Dominik; Verheyen, Jens; Hoffmann, Daniel

doi:10.1186/1756-0500-4-94

Heider, Dominik; Verheyen, Jens; Hoffmann, Daniel:

In: BMC Research Notes, Band 4 (2011), S. 94

2011Artikel/Aufsatz in ZeitschriftOA Gold

InformatikBiologieMedizinFakultät für Biologie » Bioinformatics and Computational BiophysicsForschungszentren » Zentrum für Medizinische Biotechnologie (ZMB)

Damit verbunden: 1 Publikation(en)

Titel in Englisch:

Machine learning on normalized protein sequences

Autor*in:

Heider, Dominik^UDE;Verheyen, Jens;Hoffmann, Daniel^UDE

Erscheinungsjahr:

2011

Open Access?:

OA Gold

DOI

10.1186/1756-0500-4-94

DuEPublico 1 ID

25639

URN

urn:nbn:de:hbz:464-20111104-092302-7

Notiz:

OA Förderung 2011

Sprache des Textes:

Englisch

Erschienen in

BMC Research Notes

Titel in Englisch (abgekürzt):

BMC Res Notes

Erscheinungsort:

Berlin

Verlag:

Springer Science and Business Media

in:

Band 4 (2011), S. 94

ISSN

1756-0500

ZDB ID

2413336-X

Abstract in Englisch:

Background Machine learning techniques have been widely applied to biological sequences, e.g. to predict drug resistance in HIV-1 from sequences of drug target proteins and protein functional classes. As deletions and insertions are frequent in biological sequences, a major limitation of current methods is the inability to handle varying sequence lengths. Findings We propose to normalize sequences to uniform length. To this end, we tested one linear and four different non-linear interpolation methods for the normalization of sequence lengths of 19 classification datasets. Classification tasks included prediction of HIV-1 drug resistance from drug target sequences and sequence-based prediction of protein function. We applied random forests to the classification of sequences into "positive" and "negative" samples. Statistical tests showed that the linear interpolation outperforms the non-linear interpolation methods in most of the analyzed datasets, while in a few cases non-linear methods had a small but significant advantage. Compared to other published methods, our prediction scheme leads to an improvement in prediction accuracy by up to 14%. Conclusions We found that machine learning on sequences normalized by simple linear interpolation gave better or at least competitive results compared to state-of-the-art procedures, and thus, is a promising alternative to existing methods, especially for protein sequences of variable length.

Universitätsbibliographie

Publikationsverzeichnis der Universität Duisburg-Essen

Heider, Dominik: Machine learning on normalized protein sequences

Abstract in Englisch: