The *K-nearest neighbours* classifier (KNN) is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification. Thus, the KNN approach is among the simplest of all discriminative approaches, but this classifier is still especially effective for low-dimensional feature spaces. However, the application of the KNN model in practical applications is problematic because of its low-speed performance for large datasets represented in high-dimensional feature spaces and for the large number of neighbors – *K*. In this article we address exactly this problem of the KNN model.

The input for the KNN algorithm consists of the *K* closest training samples in the feature space and the output is a class label *l*. An observation (or testing sample) *y* is classified by a majority vote of its neighbours, with the observation being labelled by the class most common among its *K* nearest neighbours (see figure below, center). In case of *K = 1* the class of that single nearest neighbour is simply assigned to the observation *y*.

In order to estimate the potentials we consider the class of every neighbour as a vote for the most likely class of the observation. If the number of neighbours, having class *l* is *K _{l}* we can define the probability of the association potentials as: (see figure above, right)

It can be useful to assign weight to the contributions of the neighbors, so that the nearer neighbors contribute more to the average than the more distant ones. For example, a common weighting scheme consists in giving each neighbor a weight of *1 / r*, where *r* is the distance to the neighbor. For our weighting scheme we modify this idea as follows: let *r* will be the Euclidean distance from the test sample to the nearest training sample in feature space and *r _{i}* – Euclidean distance to every found neighbor. Then we can rewrite the previous equation with weighting coefficient:

where *1 _{l}* means 1 if the class of the training sample is

The search algorithm aims usually to find exactly *K* nearest neighbors. However it may happen, that distant neighbors do not affect probability *p(x = l*|*y)* much. For example, the nearest neighbor with *r _{i} = r* contributes value of

The neighbors are taken from a set of objects for which the class is known. This can be thought of as the training set for the algorithm, though no explicit training step is required. A peculiarity of the KNN algorithm is that it is sensitive to the local structure of the data.

Our implementation of the KNN model in DGM C++ library is based on the KD-tree data structure, which is used to store points in *k*-dimensional space. Leafs of the KD-tree store feature vectors with corresponding groundtruth and every such feature vector is stored in one and only one leaf. Tree nodes correspond to axis-oriented splits of the space. Each split divides space and dataset into two distinct parts. Subsequent splits from the root node to one of the leafs remove parts of the dataset until only small part of the dataset (a single feature vector) is left.

KD-trees allow to efficiently perform searches *“K nearest neighbors of N”*. Considering number of dimensions *k* fixed, and dataset size *N* training samples, the time complexity for building a KD-tree is *O(N · logN)* and for finding *K* nearest neighbors – close to *O(K · logN)*. However, its efficiency decreases as dimensionality *k* grows, and in high-dimensional spaces KD-trees give no performance over naive *O(N)* linear search.

In order to evaluate the performance of our KNN model, we perform a number of experiments: 2r-KNN, 4r-KNN, 8r-KNN, 16r-KNN and 32r-KNN – models, where the nearest neighbors enclosed between two spheroids of radii *r* and *2r* (*4r*, *8r*, *16r* and *32r* respectively) are only taken into account. In the ∞r-KNN experiment all the *K* neighbors were considered. And finally the KNN experiment is the *OpenCV* implementation of KNN (**CvKNN**) based on linear search. The overall accuracies and the timings for all 7 experiments are given in table below:

2r-KNN | 4r-KNN | 8r-KNN | 16r-KNN | 32r-KNN | ∞r-KNN | CvKNN | |
---|---|---|---|---|---|---|---|

Training: |
4659 sec | 4659 sec | 4659 sec | 4659 sec | 4659 sec | 4659 sec | 102 sec |

Classification: |
8,3 sec | 22,2 sec | 52,8 sec | 97,2 sec | 134,9 sec | 216,1 sec | 45,3 sec |

Accuracy: |
81,39 % | 81,65 % | 81,97 % | 82,11 % | 82,33 % | 82,42 % | 82,36 % |

Accuracies and timings for Intel® Core™ i7-4820K CPU with 3.70 GHz required for training on 1016 scenes and classification of 1 scene.

Our 2r-KNN model gives almost the same overall accuracies as the reference KNN model, but needs almost 5.5 times less time. The training time of the xr-KNN models, which includes the building of the KD-tree, takes 78 minutes, what is much more slower then 1,7 minutes for KNN training. However, the training in practical applications is performed only once and could be done offline, when the classification time is more critical for the whole classification engine performance. In the table above we can also observe almost linear increase of the classification time with increasing the outer spheroid radius to 4r, 8r, *etc*. Figure below shows the classification results for the experiments 2r-KNN – ∞r-KNN.

The post Efficient K-Nearest Neighbours appeared first on Project X Research.

]]>The post DGM library v.1.5.3 has been just released appeared first on Project X Research.

]]>- OpenCV Artificial Neural Network
- OpenCV
*k*-Nearest Neighbors - OpenCV Support Vector Machine

and from now on uses unit-testing based on Google Test framework.

The post DGM library v.1.5.3 has been just released appeared first on Project X Research.

]]>The post Training Statistical Models appeared first on Project X Research.

]]>As we can observe from the Figure above, the generative models (Bayes and Gaussian mixtures) try to reproduce the original distributions. In order to do this precisely, a methods need to remember all the samples from the Green Field dataset – 160’000 parameters. Or, in general, restricting ourself to the 8-bit features, a method needs to remember *k·256^m* values, where *k* is the number of categories and *m* is the number of features. The main idea of the generative models is to rebuild the original distribution using much less parameters and therefore generalize the model for samples, that were not observed during training. Bayes model approximates the distribution using only *k·256·m* parameters, and the Gaussian mixture model — *k·G·m·(m +1)* parameters, where *G* is the number of Gaussians in the mixture.

As opposed to the generative models, the discriminative models (Neural Networks, Random Forests, Support Vector Machines and k-Nearest neighbors) do not approximate the original distributions, but provide direct predictions for all testing samples. This grants the discriminative models more generalization power: In the areas, where hardly any training sample was met (left bottom and right top corners of the initial distribution image on Figure) all the generative models show black areas with almost zero potentials, while all the discriminative models how a high confidence about the class labels for these areas.

The post Training Statistical Models appeared first on Project X Research.

]]>The post DGM library v.1.5.2 will support GPU computing appeared first on Project X Research.

]]>The next version of the DGM library will make use of the fast parallel computing using *Graphics Processing Units* (GPU) of the graphics cards, supporting Direct X. The current version of the DGM library supports the *Parallel Patterns Library* (PPL), which allows for the parallel multi-core computing on *Central Processing Unit* (CPU). The *Accelerated Massive Parallelism* (AMP) library takes advantage of the data-parallel hardware that’s commonly present as a GPU on a discrete graphics card. Actually, the internal mechanisms of the AMP library may decide where to execute the C++ code: on CPU or GPU.

We have made one performance test, using the task of the sparse coding dictionary learning. The bottleneck of this function is the algorithm, calculating the matrix product. Using the parallel implementation of this algorithm, we may achieve the sparse coding dictionary learning function with 90%-95% sequential processing. Our implementation of the matrix product may be found here.

Our test was performed using two systems:

- Intel® Core i7-4820K @ 3.70 GHz + NVIDIA GeForce GTX 780
- Intel® Xeon® X5450 @ 3.00 GHz + NVIDIA GeForce 210

The speed-up, when using PPL and AMP libraries for the first system may be seen at the following figure:

here, **1,00** corresponds to the time needed for 2 iterations of the training algorithm (2 180 seconds). Accordingly, for the AMP | GPU this time is **15,21** times smaller (142 seconds)

The speed-up, when using PPL and AMP libraries for the second system may be seen at the figure:

Here, the AMP library decided to run the code on CPU instead of GPU. This lead to significant performance drop. Again, **1,00** corresponds to the time needed for 2 iterations of the training algorithm (4 437 seconds on X5450). The parallel computing on CPU with the PPL library took **3,65** times less time and parallel computing on CPU with the AMP library took even more time – 4 819 seconds.

On the systems with powerful graphics cards, please build the DGM library with the option **ENABLE_AMP** in CMake. Otherwise, please use only **ENABLE_PPL** option.

The post DGM library v.1.5.2 will support GPU computing appeared first on Project X Research.

]]>The post hCell library v.1.1.1 has been just released appeared first on Project X Research.

]]>which allows representing raster images via hexagonical picture elements instead of classical quadratic ones:

Click here for more details… http://research.project-10.de/hcell/

The post hCell library v.1.1.1 has been just released appeared first on Project X Research.

]]>The post Hyperlapse Stabilization appeared first on Project X Research.

]]>Hannover in Summer Motion from Sergey Kosov on Vimeo.

The post Hyperlapse Stabilization appeared first on Project X Research.

]]>The post DGM library v.1.5.2 has been just released appeared first on Project X Research.

]]>The post DGM library v.1.5.2 has been just released appeared first on Project X Research.

]]>The post DGM library v.1.5.1 has been just released appeared first on Project X Research.

]]>The post DGM library v.1.5.1 has been just released appeared first on Project X Research.

]]>The post hCell library is now available on GitHub appeared first on Project X Research.

]]>The post hCell library is now available on GitHub appeared first on Project X Research.

]]>