Evaluation of predictive clustering quality
2016
ALAOUI ISMAILI, Oumaima | Lemaire, Vincent | Cornuéjols, Antoine
Predictive clustering [1] is a new supervised learning framework derived from traditional clustering. These algorithms start by identifying pure clusters (in terms of classes) that have a high probability density. Based on the information given by the clusters, these algorithms can predict the class of new instances. Compared to supervised classification, predictive clustering can discover the internal structure of the target class. It thus allows users to find the different reasons behind the same prediction: two heterogeneous instances could have the same predicted label. By its nature, predictive clustering incorporates the characteristics of both supervised classification and clustering. Thus, in the evaluation of predictive clustering results, three points should be taken into account: a high intra-cluster similarity, a low inter-cluster similarity and a good prediction rate. A predictive clustering quality criterion must balance these three points. In this this work, we propose a new criterion for measuring the predictive clustering quality. This criterion calculates the compactness and the separability of clusters using a new supervised similarity measure. This measure exploits the information given by the target class in such way that two instances are considered similar if and only if a distance between them is small and they belong to the same class. And, they are considered heterogeneous if and only if a distance between them is large and they belong to different classes. The obtained results from different simulated datasets show that the proposed criterion constantly gives the optimal number of clusters. To our knowledge, there is no analytic criterion in the state of the art that is able to measure the quality of the results generated by predictive clustering algorithms (the trade-off mentioned above) and therefore to compare with our suggested criterion. So, to compare our results, we use the well know unsupervised criterion (Davies-Bouldin) [2] and two supervised criteria (Adjusted Rand Index [3] and Variation of Information [4]) and we examine if our criterion find the good tradeoff.
Afficher plus [+] Moins [-]Informations bibliographiques
Cette notice bibliographique a été fournie par Institut national de la recherche agronomique
Découvrez la collection de ce fournisseur de données dans AGRIS