With the development of sensors and data processing units, we are increasingly seeing new applications for mobile robots executing tasks that are repetitive, tedious or dangerous for humans. Nowadays, mobile robots might inspect sewers or chemical facilities. They might replace people in the last mile delivery or provide security services. The number of possible applications is enormous. The central capability to do the job of such a robot is to move and operate in its environment.
The autonomy stack might be called a system of systems, including localization, mapping, navigation, and planning, to name only a few. As we explained in our previous post, Simultaneous Localization and Mapping (SLAM) is a core component of such an autonomy stack because it enables the robot to see the world, localize itself in the environment and build the internal representation of that world. However, for an average person (or even for software developers), it might be troublesome to pick the suitable algorithm that best suits a specific application. SLAM literature is full of benchmarks and comparisons, which should be helpful for any person interested in implementing such a system. Yet, the fact that they all are large and independent software frameworks that might operate on different sensors and robotic platforms, having many parameters to tune, does not simplify things.
In the following post, we compared some vision-based SLAM methods, i.e., the ones that use cameras to localize the robot and map the environment. We reviewed the reported localization accuracy in multiple papers on the EuRoC dataset. Chosen methods do not require sophisticated sensors like Light Detection and Ranging systems but regular RGB cameras. In the following comparison, we also preferred algorithms that utilized inertial measurements because inertial sensors are typically cheap yet powerful components of onboard systems. We intend to provide condensed knowledge with our in-sight comments for software developers on available SLAM systems, so please do not treat the following post as a source of scientific facts. We hope our work will simplify the developer's path between the research on available methods and actual implementation.
EuRoC MAV Dataset consists of eleven visual-inertial datasets collected onboard a Micro Aerial Vehicle. It contains synchronized stereo images collected with an Aptina MT9V034 sensor in 20 FPS synchronized with IMU measurements provided by ADIS16448, 200 Hz and ground truth. The dataset is divided into two batches - one with a millimeter-accuracy ground truth position obtained with a laser tracking system (Leica MS50 laser tracker) designed to evaluate visual-inertial localization algorithms. The second batch contains a 3D scan of the environment (Leica MS50 3D structure scan) and actual ground poses recorded with a motion capture system (Vicon). All datasets contain spatiotemporally aligned and raw measurements, Camera-IMU extrinsic and camera intrinsic calibrations and datasets for custom calibration.
RTAB-Map (Real-Time Appearance-Based Mapping) For many robotics engineers, this is the best starting point for VSLAM because it is already distributed as an open-source library in the Robot Operating System (ROS) - a popular platform for robotics researchers, engineers and enthusiasts. RTAB-Map is the oldest method in the following post because its roots reach 2013. However, it is still under active development by the open-source community. It can handle many visual data formats, i.e., monocular and stereo images, depth maps, or fish-eye cameras - all of that comes with support for IMU. RTAB-Map is an excellent alternative for localization and mapping methods working in 2D only. Notably, the method is available for commercial use without fee unless you do not use licensed third-party software.
ORB-SLAM3 It is the third version of the famous SLAM framework developed by the robotics group from the University of Zaragoza, led by prof. Tardos. For the keypoint localization assignment, the method finds ORB (Oriented FAST and rotated BRIEF) features in the stereo images and performs Bundle Adjustment to find the relative motion of a camera in 3D space. A map is built upon the visual features itself. The framework is fast, reliable, and accurate, confirmed in numerous benchmarks and existing industrial applications. It is free to use for non-commercial projects only. In the newest version, the ORB-SLAM introduces support for inertial sensors that further push this method's robustness.
Basalt This is an optimization-based system providing tools for visual-inertial calibration, odometry, mapping and a simulated environment to test different system components. Generally, it uses a Non-Linear Factor Recovery (NFR) approach for estimating the most likely pose graph with non-linear measurements. It performs keypoint-based bundle adjustment with inertial and short-term visual tracking through NFR.
DROID-SLAM Unsurprisingly, the deep learning methods also found their place in the SLAM community, achieving state-of-the-art results. DROID-SLAM is an example of such a robust and accurate solution introducing robustness and reliability confirmed by independent benchmarks. The method recursively updates the agent’s state using the well-known Bundle Adjustment method. However, manual calculations were replaced by the differentiable layer called the Dense Bundle Adjustment layer that outputs camera pose with the most probable pixel-wise depth maps. Nevertheless, the major drawback of the method is its requirements. Running a demo would require a GPU with at least 11GB of RAM while training your own DROID-SLAM requires a GPU with at least 24GB.
Kimera This method represents a specific group of techniques called a metric-semantic visual-inertial SLAM. Exciting is the semantic part, as it connects a typical task in computer vision called instance segmentation with visual-inertial SLAM – except for geometric cues, Kimera also uses semantic information for a mapping assignment. The method was released as an open-source C++ library and worked in real-time on a CPU, while ROS implementation inevitably eases the development. It is worth appreciating that the author's code includes the localization method with feature-based mapping and the entire mapping framework that outputs dense meshes.
OpenVSLAM is another ORB features-based implementation of graph SLAM. It contains many supported camera models, is licensed under BSD-2.0, and was mainly developed for practical applications and scalability rather than research. Currently, the central repository is archived due to the risk of copyright issues with ORB-SLAM, and the concerns on the legal actions are a matter under discussion. It’s worth mentioning that IMU integration is not supported in this framework.
The following section shows localization errors gathered from tests on EuRoC reported in papers about the researched methods. Both tables in this section report mean Absolute Trajectory Error (ATE), which measures how far the camera's estimated pose has gone from the reference trajectory. Typically, to do such an analysis, it is necessary to associate trajectory with ground truth using timestamps (it enables us to find the correspondences between poses). All results are in meters.
The most accurate method on the EuRoC dataset regarding industrial facilities (Machine Hall 01 - 05) was the DROID-SLAM achieving an error of 0.027 m. The authors state that deep-learning-based SLAM methods are generally more robust, while classical ones are more accurate. It may be the case for the Kimera, but DROID-SLAM appears robust and precise simultaneously. However, it is the most computationally expensive algorithm from the comparison because it requires at least two RTX 3090 GPUs to run in real-time, while training involves even more resources. These hardware requirements are the most significant limitation of the DROID-SLAM because it might be a deal-breaker for onboard processing units working on a battery.
The video shows a localization and feature-based mapping process of the Basalt.
However, from the software developer’s point of view, accuracy is not the only factor that should be considered before starting a SLAM project. Ease of use and the size of a community around one method might be among the most critical factors, especially for developers without a robotics background, defining the ability to kick off the project. RTAB-Map and Basalt outperform other methods under these requirements because they work out of the box. Additionally, the installation process is easy and automated. Moreover, they both are integrated into ROS and well documented while the community actively works on improvements. However, we think RTAB-Map might be better suited for less experienced roboticists because of its larger community and more detailed documentation with many tutorials. At the same time, Basalt is still a new method that requires some improvements in that area. A reasonable choice for the VSLAM project might also be the ORB-SLAM3, considered one of the best SLAM in the industry. The undeniable advantage of that method is its history - it has a confirmed history of successful real-world deployments and already works in numerous robotics applications. That factor significantly increases the probability that it will work again in another project. However, the code is distributed under a GPLv3 license, so if you need a closed-source solution, you will have to pay a corresponding fee to the authors.
The table above shows the results of tested SLAM methods on the EuRoC - Vicon Room (V) dataset. Again, the DROID-SLAM outperformed other methods achieving an average ATE of 0.019 m - of course, at the highest computational cost. Surprisingly, results reported for the Kimera were the worst among the selected methods. However, it consists of many components for metric-semantic SLAM, which proves the true value of that framework, not pushing forward already exorbitant benchmark results reported by other groups.
SLAM is one of the most challenging parts of autonomous robotics, and it's hard to compare these methods directly. We should consider SLAM methods more like large robotics frameworks with multiple setups available that require tedious parameter tuning (like deep learning), that can work on different robotic platforms, on numerous camera types, with different resolutions and different sources of odometry.
However, they all follow a similar structure and solve localization and mapping similarly. That is a true statement, but the configuration space of these methods is enormous. That is why the following post presents only a glimpse of the results which your VSLAM might achieve. At the end of the day, remember that SLAM has to work accurately and robustly on your data in your environment, so sticking too much to benchmark results can lead you astray. We desired to shorten the path between research on VSLAM methods and actual implementation, so if you are wondering what way suits you best, feel free to follow our tongue-in-cheek decision guide presented below.