skip to main content

Improving Rule Induction Precision for Automated Annotation by Balancing Skewed Data Sets

Batista, Gustavo E. A. P. A. ; Monard, Maria C. ; Bazzan, Ana L. C.

Knowledge Exploration in Life Science Informatics, p.20-32 [Periódico revisado por pares]

Berlin, Heidelberg: Springer Berlin Heidelberg

Texto completo disponível

Citações Citado por
  • Título:
    Improving Rule Induction Precision for Automated Annotation by Balancing Skewed Data Sets
  • Autor: Batista, Gustavo E. A. P. A. ; Monard, Maria C. ; Bazzan, Ana L. C.
  • Assuntos: Automate Annotation ; Class Distribution ; Class Imbalance ; Minority Class ; True Positive Rate
  • É parte de: Knowledge Exploration in Life Science Informatics, p.20-32
  • Descrição: There is an overwhelming increase in submissions to genomic databases, posing a problem for database maintenance, especially regarding annotation of fields left blank during submission. In order not to include all data as submitted, one possible alternative consists of performing the annotation manually. A less resource demanding alternative is automatic annotation. The latter helps the curator since predicting the properties of each protein sequence manually is turning a bottleneck, at least for protein databases. Machine Learning – ML – techniques have been used to generate automatic annotation and to help curators. A challenging problem for automatic annotation is that traditional ML algorithms assume a balanced training set. However, real-world data sets are predominantly imbalanced (skewed), i.e., there is a large number of examples of one class compared with just few examples of the other class. This is the case for protein databases where a large number of proteins is not annotated for every feature. In this work we discuss some over and under-sampling techniques that deal with class imbalance. A new method to deal with this problem that combines two known over and under-sampling methods is also proposed. Experimental results show that the symbolic classifiers induced by C4.5 on data sets after applying known over and under-sampling methods, as well as the new proposed method are always more accurate than the ones induced from the original imbalanced data sets. Therefore, this is a step towards producing more accurate rules for automating annotation.
  • Editor: Berlin, Heidelberg: Springer Berlin Heidelberg
  • Idioma: Inglês

Buscando em bases de dados remotas. Favor aguardar.