What’s the problem?
This is a blog post designed for non-technical people to understand the basics of localization. As such we are not going to touch any code but rather interactively present you with examples how these systems run and what makes them tick.
Localization is an essential part of any autonomous vehicle. Nowadays any self-driving car would use a combination of AI-driven tools to understand where it is, given some previously obtained map.
But how do you obtain a map? Using sensors, of course – all autonomous cars have multiple cameras and lasers that record the environment around them. Think of these as a collection of pictures as the one on the left, or a collection of points in space (right). These pictures or clouds are stored on the on-board computers of each vehicle and processed.
Okay I got the map, but we have GPS for determining where we are? Yes that’s true, but in addition to being largely inaccurate, it’s also not available in covered areas. Have you ever tried using Google Maps in Venice or any forest covered with interleaved trees? You’ll notice it jumps from one path to the next.
Imagine you are in the middle of Spain travelling from Barcelona to Madrid, but you don’t have any “smart” navigation to guide you. You have bypassed the ambiguous sign “Terraga ” and you see 3 plausible right turns. Your friend is sitting next to you in the car with opened maps in her lap and starts counting the right turns on the map and compares to the number of streets she sees through her side window. This way she tries to understand where your car currently is, and then makes a decision where you need to turn. Similarly, in the maps above we want to determine exactly where we are located but in much more challenging scenarios – forests, deserts, glaziers, even other planets.
Since the task of localizing in urban scenarios is relatively easy for a human being, we want to outline the challenges from the perspective of a Computer Vision algorithm operating within the self-driving vehicle. In forest environments, however, even humans struggle to recognise the place they’re currently located at without specific training.
Here’s a list of challenges:
- Appearance alteration Seasons alter the appearance of vegetation.
- Occlusion Tree branches are often interleaved by just even windy conditions.
- Clutter Trees can easily be mistaken with large bushes if both are close together due to their branch structure.
- Dynamic scenes Moving cars occlude highly recognisable landmarks.
- Viewpoint variance A car appears differently from behind and from the side.
In this post we only focus on laser (LIDAR) localization. Let’s dive into the methodology.
How do we solve it?
There are three prominent approaches to solve the localization problem.
This refers to the Iterative Closest Point (ICP) algorithm. If you’re unsure what this is, here’s an example:
In the above image think of the black collection of points (point cloud) as the original static object. In reality this black point cloud is a portion of the map. The red point cloud is the currently observed cloud that undergoes transformation to precisely overlap with that portion of the map.
The name of the method comes from the fact that it tries to find correspondence for each point from the red point cloud in the black cloud. In such a way it can determine exactly how much the red point cloud (current observation) is offset from the black point cloud (map). This offset is measured in rotation and translation (distance). In such a way the exact location is estimated in the map. In reality, however, maps look very different between when they were captured and how they look now, thus it is not always possible to associate similar objects.
Instead of finding a correspondence for each point, we can be smarter and only find a correspondence for the points that are “interesting”. These regions are called points of interest or keypoints. For this to work, we need to firstly determine what is interesting. Keypoints need to satisfy two properties – distinctiveness and repeatably. This means that 1. it is quite easy to find them in our data and 2. if we shift the point of view we would still find them. In an image those would be corners of windows, doors, etc. An example is shown below.
Once we have these areas of interest, we can compare between them. In order to do that we “describe” the area around these keypoints. That simply means that we store mathematical properties for these points. For instance, the angles between the surfaces of three walls adjacent to each other comprising a corner. These are called local descriptors that essentially shrink the interest point into a collection of numbers (vector) containing different properties (features).
We can then match these interest points by comparing the corresponding vectors of features. For example, to check whether two keypoints match the difference between their properties should be small.
Another option to perform localization is to compare the entire collection of point cloud by describing it. This means that we extract mathematical properties such as the volume of the pointcloud, or the average heights of all points and compare these across different pointclouds to find the matching one.
The third way is a combination of both local and global descriptors. Firstly, both the map and the currently observed clouds are segmented into objects to produce the equivalent of keypoints, or objects of interest in this case. Secondly, all these objects are “described” to retrieve their properties. The properties of the objects along with their position in the map are stored in a database. The objects of the currently observed scene (source cloud) are then matched against the previous map (target map) that’s in the database. An illustration of this approach is available in the following video: