Accurate scene understanding is paramount to the deployment of autonomous vehicles in real-world traffic. They need to perceive and fully understand their environment to accomplish their navigation tasks in a natural and safe manner.
To accomplish this, we have recently introduced a hierarchical framework to describe urban scenes . These scene graphs combine the generalisation power of deep networks with domain knowledge and probabilistic reasoning. They can be used for decision making and planning, while providing explanations that deep learning frameworks fail to give us so often.
Scene graphs are built up from a top-down view of the environment, since it is more natural to reason about the road from that perspective. From a bird’s-eye view perspective objects have the same (pixel) shape and size everywhere in the image. Our scene graphs take detected road markings and curbs in this perspective as an input. However, we cannot assume that every car will have a drone flying above it in the future, therefore we need to acquire this top-down view from the cameras mounted on the vehicle.
The transformation from front-facing images towards bird’s-eye view is commonly referred to as Inverse Perspective Mapping (IPM). IPM takes the frontal view as input, applies a homography transformation, and produces a top-down view of the scene by mapping the pixels to a different 2D-coordinate frame. In practice, this works well in the immediate proximity of the vehicle (assuming the road surface is planar). However, the geometric properties of objects in the distance are affected unnaturally by this transformation due to the limited resolution of the camera.
Unfortunately, this effect can have direct implications on the accuracy of our detected road markings and consequently on the quality of our scene description. For example, the image below shows that the zigzag markings at further distance are stretched and blurred to such extent that they cannot be detected anymore by a deep network (which is trained to detect road markings). Crucially, however, these markings indicate an upcoming pedestrian junction and so small inaccurate transformation errors could lead to significant safety issues.
Adversarial Learning for Improved IPM
Luckily, the transformation to perfect (think Google Maps, GTA, etc.) IPM in which the same object does indeed have equivalent (pixel) shape and size everywhere in the top-down view is not a random transformation. It is a highly-complex, non-linear mapping, but this does mean that we should be able to learn it in a generative deep learning framework.
Recently, we have presented such framework to acquire (as we call it) boosted IPM at the Intelligent Vehicles Symposium 2019 in Paris. It consist of three sequential stages, which will be discussed in more detail in the following paragraphs:
- Generating a top-down view training label, which is a more accurate representation of the real world than the homography IPM, by using a sequence of images and visual odometry in a self-supervised way.
- Training a special type of generator (Incremental Spatial Transformer GAN) which can learn a mapping between the front-facing and top-down view.
- Deploying the trained generator in real time to retrieve boosted IPM for the benefit of scene understanding tasks.
Generating Boosted IPM Labels
We generate an accurate top-down view label from a sequence of images and visual odometry.
More specifically, we step through the sequence of images and map every image into bird’s-eye view by the homography transformation (acquired from the sensor calibrations). We make use of the fact that this transformation is fairly accurate in the immediate proximity of the vehicle and stitch the top-down pixels into the right place of the label by using localisation retrieved from visual odometry. As we “drive” (i.e. stepping through the images) further into the distance, we get a much sharper view of what is going on right there and thereby generate a top-down view which is closer to the ground-truth than the homography mapping. This label is then paired with the first frontal image of the sequence to create a training pair.The Oxford RobotCar dataset contains traversals of many different traffic environments under varying weather and lighting conditions. And so, we are able to generate thousands of boosted IPM labels in a self-supervised manner as shown below. The initial frame that is paired with the finished stitched label is displayed in the red border.However this stitching technique does have some consequences for the quality of the labels. Firstly, during nighttime it is hard to stitch labels without including a tiny bit of the headlights of the ego car. Secondly, and more importantly, dynamic objects have moved by the time we get to the position at which they appeared in the initial frame. Consequently, dynamic objects appear at different places in the label then you would expect given the initial frame. This means that we lose the one-to-one relationship between the pixels of the front-facing image and the label. Therefore, we refrain from using a pixel-wise loss during the training of the GAN.
Training for Boosted IPM
The task of transforming the frontal view into a top-down view can be seen as an image-to-image translation task. A state-of-the-art solution for this is the Pix2PixHD framework , which is able to transform semantic labels into RGB images. This is relatively easy nowadays as long as the two image domains share the same perspective. However, for our task this assumption does not hold and therefore we introduce a different version of the generator: the Incremental Spatial Transformer GAN.
We start with the generator of the Pix2PixHD framework, but we alter the bottleneck of the network. In there, we add a series of incremental spatial transformers which map the frontal image down a few degrees each until we reach IPM. Because these transformations slightly blur and stretch the feature maps, we place a ResNet block immediately behind them to sharpen the features. This process is shown below as a conceptual visualization of just 3 feature maps (RGB), obviously the actual generator contains many more feature maps at its bottleneck.
Now that we have generated boosted IPM labels and trained an altered network, we can deploy it in real time.
Front facing camera images are fed into the generator and we retrieve boosted IPM as the output. Below are some videos that show the difference between (common) homography-based IPM and boosted IPM. Three important observations can be made:
- Boosted IPM contains sharper features (e.g. road markings) at further distance. This makes it easier to detect the road markings and thereby improves scene understanding tasks.
- Boosted IPM generally contains more homogenous illumination, for instance in cases where the homography-based IPM is overexposed.
- Boosted IPM reveals the underlying road layout behind occlusions (e.g. dynamic objects). Because of the way that we stitch our labels, we get a view “behind” dynamic objects in our boosted IPM labels. This gives the network the ability to learn to remove dynamic objects and show the underlying road layout, which is again beneficial for scene understanding tasks.
The framework can also be deployed under more difficult weather and lighting conditions. Below we show results during nighttime. This is a difficult task, since there is lots of artificial lighting and motion blur in the front-facing images. However, similarly to the dynamic object / overexposure removal during daytime, the network learns from the context of the scene how the road layout should look like underneath the headlights of the car. The results are not perfect, but we provide a first approach towards learned IPM even when the input data is far from ideal.
Improved Scene Understanding through Boosted IPM
Now that we have found a way to retrieve boosted IPM in real time, let’s look back at the example that we have started with. Does boosted IPM actually resolve the issue of undetected road markings?
The same scene is displayed again below; this time as a video. We can see that boosted IPM (on the right) does indeed lead to sharper road markings at a distance. Consequently, the zigzag markings, indicating a pedestrian junction, are detected earlier and the vehicle can be prepared in time for this scenario. Interestingly, boosted IPM also predicts the continuation of the double yellow boundaries on the right through the van, indicating that it is illegally parked there!
Looking at the generated scene graph (based on the detected road markings), there is a substantial amount of information (nodes inside the blue box) added to the description just by employing boosted IPM. From experiments, we have concluded that boosted IPM leads to richer and more robust scene understanding in urban environments.
 L. Kunze, T. Bruls, T. Suleymanov, and P. Newman, “Reading between the lanes: Road layout reconstruction from partially segmented scenes,” in 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Nov 2018, pp. 401–408.
 T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz, and B. Catanzaro, “High-resolution image synthesis and semantic manipulation with conditional GANs,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2018, pp. 8798–8807.
For more details and results, please find the accompanying paper here or watch the video below in high quality.