This blog post provides an overview of our paper “Vote3Deep: Fast Object Detection in 3D Point Clouds Using Efficient Convolutional Neural Networks” by Martin Engelcke, Dushyant Rao, Dominic Zeng Wang, Chi Hay Tong, and Ingmar Posner which was published at the IEEE International Conference on Robotics and Automation (ICRA) 2017.

For a quick overview you can take a look at our video:

Why do we care?

In recent years, computer vision techniques for processing 2D images have undergone a revolution through the use of convolutional neural networks (CNNs). Building upon the availability of large datasets and fast GPU implementations, CNNs are dominating most of the popular benchmarks, such as the ImageNet challenge.

In robotics applications, 3D point clouds obtained from LIDAR sensors are commonly used to sense a robot’s environment. 3D data is highly informative. After all, we live in a 3D world and as opposed to images, depth information is directly available in point clouds. Nevertheless, there has not yet been a comparable breakthrough for processing 3D point clouds with CNNs similar to what we have been experiencing in 2D vision with regards to practical applications.

What is the reason for this?

Under the hood, CNNs convolve small filter kernels spatially across the input data in a hierarchical fashion. For images, this corresponds to a convolution in two spatial dimensions. Supposing we discretise each point cloud into a 3D grid with a fixed resolution, we now need to convolve our filters across three spatial dimensions. For full-sized point cloud frames, say 60m-by-60m-by-5m and at a reasonable grid resolution that gives us a good enough accuracy, say 20cm, this leads to a prohibitive increase in computation time which renders a naive transfer of CNNs to 3D point clouds infeasible.

How can this be mitigated?

As opposed to dense images where each pixel has a value, 3D point clouds are inherently sparse as most physical space is empty. Consequently, when performing a dense convolution across sparse input data, most of the computation time is wasted by multiplying the convolutional filter weights with zeros. We therefore know that informative computation is only going to take place at the location of non-empty grid cells.

Previous work on Vote3D proposed a novel feature-centric voting algorithm to perform sparse convolutions across point clouds for sliding window detection window detection with linear SVMs. Because a point cloud is discretised into a grid, we can have a variable number of points falling into a single grid cell. Rather than only using a binary occupancy value as the input to the model, a small feature vector representing the statistics of the points that fall into the cell is used instead.

A convolution can be reformulated as a feature-centric voting operation as illustrated in this 2D example for a single feature map. Flipping the convolutional weights along each spatial dimension returns the corresponding voting weights. We only need to multiply the two non-zero input features by the voting weights. The feature of value “1” multiplied by the voting weights results in the top-left green square in the result. Similarly, the feature of value “0.5” results in the bottom-right green square in the result. Votes that fell into same cells are added together. The computed result is mathematically equivalent to a traditional dense convolution. Note that we ignore the bias term here for simplicity.

How is Vote3Deep different?

Vote3Deep stacks several layers of feature-centric voting to build efficient, yet highly expressive CNNs. To the best of our knowledge, Vote3Deep is the first approach that processes full point clouds frames showing large scenes with CNNs natively in 3D at fast processing times. As opposed to our work, previous CNN-based methods for processing point clouds either first project the 3D points into a 2D map or only consider comparatively small 3D crops.

We showcase this on the popular KITTI Object Detection Benchmark where we outperform previous state-of-the-art methods at the time of writing by considerable margins, while still maintaining a detection speed of about one second per frame on a commercial laptop CPU.

After discretising a point cloud into a sparse 3D grid, we leverage sparse convolutions via voting to build efficient CNNs so that we can process full point clouds natively in 3D. Here, detection scores are predicted for cars (red), pedestrians (blue), and cyclists (magenta).

How does this work?

As each non-empty cell will cast votes to several cells in the next layer as specified by the kernel size of the filters in a layer. This means that the sparsity is expected to decrease after each successive convolutional layer. In order to counteract this behaviour, we make several critical choices in order to maintain sparsity in the intermediate representations so that efficient sparse convolutions via voting can be leveraged throughout the entire CNN stack.

In particular, we make use of ReLU non-linearities in between layers and restrict filter biases to be non-positive as a single positive bias value would completely eliminate sparsity. Moreover, we propose an L1 sparsity penalty on the post-ReLU activations in order to encourage the model to learn intermediate representations with higher sparsity by discarding features that are not truly relevant for the task to be solved. We demonstrate that this improves the detection speed in some cases by almost up to 40% at a minimal trade-off in detection accuracy.

More detailed information as well as ideas for future research directions can be found in our paper.

This illustration shows a network with three convolutional layers. The input (green) is obtained by discretising a point cloud into a sparse 3D grid where each occupied ell holds a feature vector of length 6. In the first two layers, eight convolutional filters with a kernel size of 3-by-3-by-3 each followed by a ReLU layer compute the intermediate representations (blue). The kernel size of the final layer is set as to achieve the desired total receptive field size for detecting a certain class, e.g. a larger total receptive field is needed for detecting cars as opposed to pedestrians due to the difference in physical size.