MIT Logo CSAIL Logo MIT Marine Robotics Group MRG Logo

SEO-SLAM

1MIT CSAIL
{jungseok,ranch0,jleonard}@mit.edu


A sample semantic mapping result by SEO-SLAM. The plot shows a 3D map of the scene in the RGB image, including landmark positions and a camera trajectory, where numbers/colors represent class labels shown in the legend. Each shoe is assigned descriptive labels capturing their visual characteristics created by our proposed MLLM agents.


News

event [March 2025] Our test data released!
event [March 2025] We released the SEO-SLAM Project page.

Abstract

Object Simultaneous Localization and Mapping (SLAM) systems struggle to correctly associate semantically similar objects in close proximity, especially in cluttered indoor environments and when scenes change. We present Semantic Enhancement for Object SLAM (SEO-SLAM), a novel framework that enhances semantic mapping by integrating heterogeneous multimodal large language model (MLLM) agents. Our method enables scene adaptation while maintaining a semantically rich map. To improve computational efficiency, we propose an asynchronous processing scheme that significantly reduces the agents' inference time without compromising semantic accuracy or SLAM performance. Additionally, we introduce a multi-data association strategy using a cost matrix that combines semantic and Mahalanobis distances, formulating the problem as a Linear Assignment Problem (LAP) to alleviate perceptual aliasing. Experimental results demonstrate that SEO-SLAM consistently achieves higher semantic accuracy and reduces false positives compared to baselines, while our asynchronous MLLM agents significantly improve processing efficiency over synchronous setups. We also demonstrate that SEO-SLAM has the potential to improve downstream tasks such as robotic assistance.




Method


Overview of the SEO-SLAM system: The pipeline begins with an RGBD image input, from which odometry is derived. Object measurements from an open-vocabulary detector are fed into a factor graph. MAP estimation is obtained through factor graph optimization. Landmarks from the current map are projected onto the camera frame and overlaid on the image. This composite image is used as input for the MLLM agents, which evaluate each landmark based on the current scene. For example, if the agents inform \( l_{j+1} \) is an erroneous landmark, factors related to \( l_{j+1} \) are removed. The yellow-colored region represents the primary SLAM pipeline, while the green-colored region denotes an asynchronous process that does not impact MAP runtime.

Paper


SEO-SLAM

Jungseok Hong and Ran Choi and John J. Leonard

integration_instructions Data
This webpage template was recycled from here.