Ensembles of classifiers for multi-class classification problemsone versus one, imbalanced data-sets and difficult classes
- GALAR IDOATE, Mikel
- Edurne Barrenechea Tartas Director
- Francisco Herrera Triguero Co-director
- Alberto Fernández Hilario Co-director
Defence university: Universidad Pública de Navarra
Fecha de defensa: 18 July 2012
- Pedro Melo Pinto Chair
- Miguel Pagola Barrio Secretary
- César García Osorio Committee member
Type: Thesis
Abstract
The construction of the classifiers is a key issue in Machine Learning. There exist different classifier learning paradigms, which main differences are the learning procedure and the type of model inferred, that is, its interpretability and how it stores the knowledge extracted. However, and without taking into account the paradigm used to build the classifier, the combination of classifiers usually leads to systems with better accuracies. Accuracy is measured by the percentage of correctly classified examples, which were not used in the classifier learning phase. These systems combining several classifiers are referred to as ensembles or multi-classifiers. The ability of the ensembles of classifiers to increase the accuracy over the systems with single classifiers has been proved in several applications. In these systems the aggregation phase, that is, how the outputs of the base classifiers are combined in order to predict the final class is a key factor In this dissertation, we focus on studying the usage of ensembles of classifiers in two different fields where they have shown to be beneficial, besides from another problem that arises from their use in some models. Classification problems with multiple classes: Classification problems can be divided into two types, depending on the number of classes in the problem: binary and multi-class problems. In general, it is easier to build a classifier to distinguish only between two classes than to consider more than two classes in a problem, since the decision boundaries in the former case can be simpler. This is why binarization techniques come up, to deal with multi-class problems by dividing the original problem in more easier to solve binary classification problems, which are faced up by independent binary classifiers. In order to classify a new instance, the outputs of all the classifiers in the ensemble are combined to decide the class used to label the instance. The usage of ensembles in multi-classification problems allows one to improve the results obtained when a single classifier is used to distinguish all the classes at the same time, due to the simplification of the initial problem. In this context, we have first studied the different aggregations for the One-vs-One and One-vs-All strategies, and we have proposed a novel methodology based on dynamic classifier selection techniques in order to improve the classification in One-vs-One scheme avoiding non-competent classifiers. The class imbalance problem: Class imbalance problem refers to data-sets having a very different number of instances from the different classes, that is, their presence in the data-set is not balanced as expected by the classifier. This problem is usually studied in binary problems, where one of the classes is under-represented in the data-set, usually the class of interest (positive or minority class). This class is usually much more difficult to distinguish, achieving low classification rates over its examples. The usage of ensembles in combination with techniques which are usually considered to deal with the class imbalance problems (such as data pre-processing or cost-sensitive techniques) has shown its ability to improve the accuracy over data-sets suffering from this problem, enhancing the results obtained by single classifiers with the previously mentioned techniques. We have proposed a new taxonomy in order to classify these approaches and to study the most robust methods in the literature. Afterwards, we have proposed a novel ensemble method, which is able to overcome the previous ones combining the evolutionary undersampling method and a diversity promotion mechanism. The problem of difficult classes: The difficult classes problem is more general than the class imbalance problem. A class is said to be difficult whenever the classification accuracy produced by the classifiers over it is much lower than that of the other classes, which can lead to ignore it. This problem has not received much attention in the specialized literature despite its importance. There exist several real-world problems where an equally recognition rate over all classes is much more important than being accurate over some of them. For example, in the classification of traffic signs, authors identification, cancer diagnosis, pattern detection in videos or texture classification. We have studied this problem in One-vs-One strategy, both theoretically and empirically, and we have presented a new aggregation strategy, which accounts for the difficult classes problem, trying to balance the classification rate over all classes without needing to alter the underlying base classifiers.