discriminative – Project X Research

January 14 2018

Efficient K-Nearest Neighbours

Sergey Kosov Article classification, DGM, discriminative 2

The K-nearest neighbours classifier (KNN) is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification. Thus, the KNN approach is among the simplest of all discriminative approaches, but this classifier is still especially effective for low-dimensional feature spaces. However, the application of the KNN model in practical applications is problematic because of its low-speed performance for large datasets represented in high-dimensional feature spaces and for the large number of neighbors – K. In this article we address exactly this problem of the KNN model.

The input for the KNN algorithm consists of the K closest training samples in the feature space and the output is a class label l. An observation (or testing sample) y is classified by a majority vote of its neighbours, with the observation being labelled by the class most common among its K nearest neighbours (see figure below, center). In case of K = 1 the class of that single nearest neighbour is simply assigned to the observation y.

: The original distributions of 160'000 samples from the dataset

: Resulting k-Nearest Neighbors decision map

: k-Nearest Neighbors classifier

In order to estimate the potentials we consider the class of every neighbour as a vote for the most likely class of the observation. If the number of neighbours, having class l is K_l we can define the probability of the association potentials as: (see figure above, right)

$p(x=l,|,y)=frac{K_l}{K}$

It can be useful to assign weight to the contributions of the neighbors, so that the nearer neighbors contribute more to the average than the more distant ones. For example, a common weighting scheme consists in giving each neighbor a weight of 1 / r, where r is the distance to the neighbor. For our weighting scheme we modify this idea as follows: let r will be the Euclidean distance from the test sample to the nearest training sample in feature space and r_i – Euclidean distance to every found neighbor. Then we can rewrite the previous equation with weighting coefficient:

$p(x=l,|,y)=frac{1}{K}sum_i{frac{1_l}{(1+r_i-r)^2}},$

where 1_l means 1 if the class of the training sample is l and 0 otherwise.

Optimization

The search algorithm aims usually to find exactly K nearest neighbors. However it may happen, that distant neighbors do not affect probability p(x = l|y) much. For example, the nearest neighbor with r_i = r contributes value of 1 / K to the probability. And a neighbor, twice as distant from the testing sample (r_i = 2r) will contribute only 1 / K(1 + r)². For the optimization purpose we stop the search once the distance from the test sample to the next nearest neighbor exceeds 2r. Thus, only K’ ≤ K neighbors in area enclosed between two spheroids of radii r and 2r are considered (see figure below) and weighted according to the equation: p(x = l|y) = K_l / K’.

Illustration of the nearest neighbors screening: if the distance to the nearest neighbor is r, we take into consideration only those neighbors that lie closer then 2r distance.

The neighbors are taken from a set of objects for which the class is known. This can be thought of as the training set for the algorithm, though no explicit training step is required. A peculiarity of the KNN algorithm is that it is sensitive to the local structure of the data.

Evaluation

Our implementation of the KNN model in DGM C++ library is based on the KD-tree data structure, which is used to store points in k-dimensional space. Leafs of the KD-tree store feature vectors with corresponding groundtruth and every such feature vector is stored in one and only one leaf. Tree nodes correspond to axis-oriented splits of the space. Each split divides space and dataset into two distinct parts. Subsequent splits from the root node to one of the leafs remove parts of the dataset until only small part of the dataset (a single feature vector) is left.

KD-trees allow to efficiently perform searches “K nearest neighbors of N”. Considering number of dimensions k fixed, and dataset size N training samples, the time complexity for building a KD-tree is O(N · logN) and for finding K nearest neighbors – close to O(K · logN). However, its efficiency decreases as dimensionality k grows, and in high-dimensional spaces KD-trees give no performance over naive O(N) linear search.

In order to evaluate the performance of our KNN model, we perform a number of experiments: 2r-KNN, 4r-KNN, 8r-KNN, 16r-KNN and 32r-KNN – models, where the nearest neighbors enclosed between two spheroids of radii r and 2r (4r, 8r, 16r and 32r respectively) are only taken into account. In the ∞r-KNN experiment all the K neighbors were considered. And finally the KNN experiment is the OpenCV implementation of KNN (CvKNN) based on linear search. The overall accuracies and the timings for all 7 experiments are given in table below:

	2r-KNN	4r-KNN	8r-KNN	16r-KNN	32r-KNN	∞r-KNN	CvKNN
Training:	4659 sec	4659 sec	4659 sec	4659 sec	4659 sec	4659 sec	102 sec
Classification:	8,3 sec	22,2 sec	52,8 sec	97,2 sec	134,9 sec	216,1 sec	45,3 sec
Accuracy:	81,39 %	81,65 %	81,97 %	82,11 %	82,33 %	82,42 %	82,36 %

Accuracies and timings for Intel® Core™ i7-4820K CPU with 3.70 GHz required for training on 1016 scenes and classification of 1 scene.

Our 2r-KNN model gives almost the same overall accuracies as the reference KNN model, but needs almost 5.5 times less time. The training time of the xr-KNN models, which includes the building of the KD-tree, takes 78 minutes, what is much more slower then 1,7 minutes for KNN training. However, the training in practical applications is performed only once and could be done offline, when the classification time is more critical for the whole classification engine performance. In the table above we can also observe almost linear increase of the classification time with increasing the outer spheroid radius to 4r, 8r, etc. Figure below shows the classification results for the experiments 2r-KNN – ∞r-KNN.

: 2r-KNN

: 4r-KNN

: 8r-KNN

: 16r-KNN

: 32r-KNN

: ∞r-KNN

October 22 2017

Training Statistical Models

Sergey Kosov Article classification, DGM, discriminative, generative, unary model 0

There is a wide variety of statistical models which may be applied to the semantic segmentation tasks. Let us now illustrate their impact on the computation of the label maps. For this purpose we use synthetic Green Field data-set, with 3 classes, described by two features. If we quantize all the features by 8 bit, we can map the whole data set to the 2-dimensional 256 x 256 feature space. If we accumulate the sample points in such representation, it will correspond to the probability densities. For the visualization we will mark these densities, belonging to different classes, with tree different colors: red, green and blue (see Figure below).

: The original distributions of 160'000 samples from the dataset

: Naïve Bayes model

: Gaussian Model: the distribution is approximated with a single Gaussian per class

: Sequential Gaussian Mixture Model

: Gaussian Mixture Model, estimated with help of Expectation-Maximization algorithm

: k-Nearest Neighbors classifier

: Support Vector Machines classifier

: Random Forest classifier

: Artificial Neural Networks classifier

As we can observe from the Figure above, the generative models (Bayes and Gaussian mixtures) try to reproduce the original distributions. In order to do this precisely, a methods need to remember all the samples from the Green Field dataset – 160\’000 parameters. Or, in general, restricting ourself to the 8-bit features, a method needs to remember k·256^m values, where k is the number of categories and m is the number of features. The main idea of the generative models is to rebuild the original distribution using much less parameters and therefore generalize the model for samples, that were not observed during training. Bayes model approximates the distribution using only k·256·m parameters, and the Gaussian mixture model — k·G·m·(m +1) parameters, where G is the number of Gaussians in the mixture.

As opposed to the generative models, the discriminative models (Neural Networks, Random Forests, Support Vector Machines and k-Nearest neighbors) do not approximate the original distributions, but provide direct predictions for all testing samples. This grants the discriminative models more generalization power: In the areas, where hardly any training sample was met (left bottom and right top corners of the initial distribution image on Figure) all the generative models show black areas with almost zero potentials, while all the discriminative models how a high confidence about the class labels for these areas.