Workshop Description and Motivation

One of the major success stories of Computer Vision over the 1990s and 2000s was the development of systems that can simultaneously build a map of the environment and localize a camera with respect to that environment. Within a batch framework this is usually known as structure from motion or multi-view reconstruction, and systems that can reconstruct city-scale (or even wider-scale) environments with accurate geometry are now common, whether from video or from a less well ordered set of images.  Within the robotics and real-time vision communities this problem is often referred to as visual SLAM, Simultaneous Localization and Mapping.  The current state-of-the-art visual SLAM systems can reconstruct large areas either densely or semi-densely (i.e. with a depth estimate at all or many pixels) with high accuracy using just a single camera, in real-time.

As impressive as these systems and algorithms are, they have no understanding of the scenes they observe – at best they provide a dense geometric point-cloud – and fall well short of the ability of the human visual system to assimilate high-level visual information.  In contrast to much of the work in multi-view analysis, those working with single images have made significant progress in applications such as segmentation and object and place recognition, so that an image can be labelled with high level designations of its content, explaining not just the geometry, but also the semantic content of the image.  Nevertheless these algorithms are often slow, require offline learning that does not necessarily adapt or transfer between environments, and importantly have no temporal component. 

Within the last few years a number of researchers in computer vision and robotics have recognised the benefits that applying this semantic level analysis to image sequences acquired by a moving camera, proposing a shift from purely geometric reconstruction and mapping, to semantic level descriptions of scenes involving objects, surfaces, attributes and scene relations that together capture an understanding of the scene. This shift will enable recognition of long-term change through maps that are more versatile, informative, and compact and so it is likely that through advances in this area the quest for long-term autonomy can make greatest gains. 

This workshop will bring together interested researchers in multi-view geometry, scene understanding and robotic vision with the common goal of discussing and presenting the state-of-the-art in the use of semantics in structure from motion and robotic vision, and in considering the most fruitful and challenging areas for the development of the field.  The workshop will cmpirse invited talks, and a limited number of submitted talks, selected for presentation by the organising committee.  Topics to be addressed will include:

Important dates


Invited speakers

The following are confirmed invited speakers:


The program is tentative, but indicative

Program Item
9.00Welcome and Introduction
9.15Invited talk: Martial Hebert, Semantically-referenced navigation
9.50Invited talk: Raquel Urtasun, Exploiting the Web for Reconstruction, Recognition and Self-localization
10:25Arsalan Mousavian and Jana Kosecka, Semantically Aware Bag-of-Words for Localization
10.40Coffee Break
11.00Ankur Handa, Viorica Patraucean, Vijay Badrinarayanan, Simon Stent and Roberto Cipolla, SynthCam: Semantic Understanding With Synthetic Indoor Scenes
11:15Yinda Zhang, Shuran Song, Ping Tan and Jianxiong Xiao, PanoContext: A Whole-room 3D Context Model for Panoramic Scene Understanding
11.30Invited talk: John Leonard, Challenges for Life-Long Visual Mapping and Navigation
12:05Invited talk: Ashutosh Saxena, RoboBrain: Large-Scale Knowledge Engine for Robots
13:40Cancelled: Richard Newcombe, What do we hope to achieve with visual SLAM?
14:05Fisher Yu, Jianxiong Xiao and Thomas Funkhouser, Semantic Alignment of LiDAR Data at City Scale
14:20Hayko Riemenschneider, Andras Bodis-Szomoru, Andelo Martinovic, Julien Weissenberg, Luc Van Gool, What is needed for Multi-view Semantic Segmentation?
14:50Coffee break
15:10Jinglu Wang, Showei Li, Jongbo Liu, Honghui Zhang, Tian Fang, Siyu Zhu, Punze Zhang, Shengnan Cai and Long Quan, Semantic Segmentation of Large-scale Urban 3D data with Low Annotation Cost
15:25Invited talk: Marc Pollefeys, Semantic 3D Reconstruction
16:00Discussion / wrap-up