Can an autonomous agent navigate in a new environment without ever building an explicit map? For the task of PointGoal navigation (`Go to $\Delta x$, $\Delta y$`) under idealized settings (no RGB-D and actuation noise, perfect GPS+Compass), the answer is a clear `yes` — map-less neural models composed of task-agnostic components (CNNs and RNNs) trained with large-scale reinforcement learning achieve 100% Success on a standard dataset (Gibson ). However, for PointNav in a realistic setting (RGB-D and actuation noise, no GPS+Compass), this is an open question; one we tackle in this paper. The strongest published result for this task is 71.7% Success 1. First, we identify the main (perhaps, only) cause of the drop in performance: absence of GPS+Compass. An agent with perfect GPS+Compass faced with RGB-D sensing and actuation noise achieves 99.8% Success (Gibson-v2 val). This suggests that (to paraphrase a meme) robust visual odometry is all we need for realistic PointNav; if we can achieve that, we can ignore the sensing and actuation noise. With that as our operating hypothesis, we develop human-annotation-free data-augmentation techniques to train neural models for visual odometry. Taken together with our other proposed methods, we advance state of the art on the Habitat Realistic PointNav Challenge — SPL by 40% (relative), 53 to 74, and Success by 31% (relative), 71 to 94. While our approach does not saturate or `solve` this dataset, this strong improvement provides evidence consistent with the hypothesis that explicit mapping may not be necessary for navigation, even in realistic setting.
1According to Habitat Challenge 2020 PointNav benchmark held annually. A concurrent as-yet-unpublished result has reported 91% Success on 2021's benchmark, but we are unable to comment on the details because an associated report is not available.
Agent is asked to navigate from blue square to green square. The color of the trajectory changes from dark to light over time (cv2.COLORMAP_WINTER for agent's trajectory, cv2.COLORMAP_AUTUMN for agent's estimate of its trajectory).
Navigation videos and top-down maps were generated during two different agent runs, which means other actuation and sensing noise were applied, so the trajectories on video and image may be slightly different.
To see navigation metrics for episodes from the playlists above, please, open the video in a separate tab and check the video description.