Cahiers du CEREMADE

Unité Mixte de Recherche du C.N.R.S. N°7534
Abstract : The k-nearest-neighbour procedure is a well-known deterministic method used in supervised classification. While it has been superseded by more recent methods developed in machine learning, it remains an essential tool for classifiers. This paper proposes a reassessment of this approach as a statistical technique derived from a proper probabilistic model; in particular, we modify the assessment made in a previous analysis of this method undertaken by Holmes and Adams (2002, 2003) where the underlying probabilistic model is not completely well-defined. Once clear probabilistic bases of the $k$-nearest-neighbour procedure are established, we proceed to the derivation of practical computational tools to conduct Bayesian inference on the parameters of the corresponding model. In particular, we assess the difficulties inherent to pseudo-likelihood and to path sampling approximations of a missing normalising constant, and propose a perfect sampling strategy to implement a correct MCMC sampler associated with our model. Illustrations of the performance of the corresponding Bayesian classifier are provided for two benchmark datasets, demonstrating in particular the limitations of the pseudo-likelihood approximation in this set-up.
A Bayesian reassessment of nearest--neighbour classification
Université de PARIS - DAUPHINE
Place du Maréchal de Lattre De Tassigny - 75775 PARIS CEDEX 16 - FRANCE
Téléphone : +33 (0)1 44-05-49-23 - fax : +33 (0)1 44-05-45-99