MIT Logo CSAIL Logo MIT Marine Robotics Group MRG Logo

SEO-SLAM

1MIT CSAIL
{jungseok,ranch0,jleonard}@mit.edu


Demonstration of SEO-SLAM's semantic mapping capabilities. (a) Initial mapping with generic labels. (b) Detection with descriptive labels from the MLLM feedback and building maps with more landmarks. (c) Estimated semantic map with all the shoes associated successfully. (d) Updated semantic map after scene change (white shoe removed). Top row: Object detection results; Middle row: Estimated semantic maps; Bottom row: Landmarks projected onto camera frames to be used as the MLLM's input. This sequence illustrates SEO-SLAM's ability to refine object labels, update maps in cluttered environments, and adapt to scene changes.


News

event [November 2024] Our test data released!
event [November 2024] We uploaded our paper to arxiv!
event [September 2024] We released the SEO-SLAM Project page.

Abstract

Semantic Simultaneous Localization and Mapping (SLAM) systems struggle to map semantically similar objects in close proximity, especially in cluttered indoor environments. We introduce Semantic Enhancement for Object SLAM (SEO-SLAM), a novel SLAM system that leverages Vision-Language Models (VLMs) and Multimodal Large Language Models (MLLMs) to enhance object-level semantic mapping in such environments. SEO-SLAM tackles existing challenges by (1) generating more specific and descriptive open-vocabulary object labels using MLLMs, (2) simultaneously correcting factors causing erroneous landmarks, and (3) dynamically updating a multiclass confusion matrix to mitigate object detector biases. Our approach enables more precise distinctions between similar objects and maintains map coherence by reflecting scene changes through MLLM feedback. We evaluate SEO-SLAM on our challenging dataset, demonstrating enhanced accuracy and robustness in environments with multiple similar objects. Our system outperforms existing approaches in terms of landmark matching accuracy and semantic consistency. Results show the feedback from MLLM improves object-centric semantic mapping.




Method


The pipeline begins with an RGBD image input, from which odometry is derived. RAM processes the RGB image to generate object tag lists. GroundingDINO and SAM then localize and segment objects based on these tags. Geometric and semantic information from the RGBD image and odometry are fed into a factor graph. MAP estimation is obtained through factor graph optimization. Landmarks from the current map are projected onto the camera frame and overlaid on the image. This composite image is used as input for MLLM, which provides feedback on each landmark based on the current scene. The MLLM's feedback is used to update: (1) the semantic label database for GroundingDINO, (2) the multiclass prediction confusion matrix, and (3) the list of erroneous factors. For example, if the feedback informs lj+1 is an erroneous landmark, factors related to lj+1 are removed.

Paper


SEO-SLAM

Jungseok Hong and Ran Choi and John J. Leonard

description arXiv version
insert_comment BibTeX
integration_instructions Data

Citation


@inproceedings{hong2024seoslam,
    title=Learning from Feedback: Semantic Enhancement for Object SLAM Using Foundation Models,
    author={Jungseok Hong and Ran Choi and John J. Leonard},
    booktitle={arxiv Preprint},
    year={2024}
}
This webpage template was recycled from here.