In this blog post, we’ll take a look at our paper “Meshed Up: Learnt Error Correction in 3D Reconstructions”, which will be presented at ICRA 2018, Brisbane, Australia. [arXiv]
A lot of our work looks at how to build better maps. There are many ways in which maps can be represented, for example:
- a graph of images related by the transforms between them, like in Dub4 [YouTube];
- pointclouds with or without colours, used for localisation [pdf];
- dense 3D models (meshes), like in BOR2G [YouTube].
Building large scale dense reconstructions from a moving platform (such as a surveying vehicle or a self-driving car) is very hard. The biggest problem at scale is that often there simply isn’t enough data for creating detailed models. Furthermore, sensors that operate at such scales tend to be more noisy than the ones designed to operate in small-scale scenarios: an RGB-D camera produces very good depth-maps but with a range of about 4 m, while a stereo camera has a range an order of magnitude higher, but produces noisier depth-maps. The scans obtained from lidars have less noise than depth-maps, but are a lot sparser. Another challenge for this scenario is that you often can’t get close enough or look at things from multiple angles if you’re driving down the street in an autonomous car, so you have to be able to deal with missing data.
BOR2G (Tanner et al. 2015, [pdf]) does a great job at handling these kinds of scenarios, and it is able to integrate 3D data from many different sources into the same model. Below is an example reconstruction of Broad Street in Oxford that was built with BOR2G.
While it’s hard to infer some of the larger missing surfaces in the above reconstruction (such as the roof of the Sheldonian Theatre in the middle of the image), there are smaller obvious errors. A human, without any particular knowledge of the scene, but with a bit of common sense (such as knowing that buildings or roads don’t normally have holes in them, cars are roughly of a given set of shapes, etc.), can suggest fixes to this reconstruction.
In our paper, we present a method to learn just that: what do good reconstructions look like? The main idea is using a convolutional neural network (CNN) to learn the difference between high-quality and low-quality models. That way, after a few expensive, detailed surveys, you can start using cheaper sensors to create better reconstructions then otherwise possible.
To train the CNN, we start with two BOR2G reconstructions of the same scene – a detailed, high-quality one, and one of lower quality. For example, in our work we use laser and stereo camera reconstructions of KITTI-VO sequences as high-quality and low-quality reconstructions, respectively. We virtually “fly” through each reconstructions and generate pairs of 2D feature images (depth-maps, normals, etc.). We train our CNN to predict residual error in the depth-map from the low-quality reconstruction with respect to the high quality one. At run time, we subtract the predicted error from the depth-map to obtain a better one, that can then be used in a new, better reconstruction.
The following video gives an overview of our work, and more technical details can be found in the paper.