A sample semantic mapping result by SEO-SLAM. The plot shows a 3D map of the scene in the RGB image, including landmark positions and a camera trajectory, where numbers/colors represent class labels shown in the legend. Each shoe is assigned descriptive labels capturing their visual characteristics created by our proposed MLLM agents.
Object Simultaneous Localization and Mapping (SLAM) systems struggle to correctly associate semantically similar objects in close proximity, especially in cluttered indoor environments and when scenes change. We present Semantic Enhancement for Object SLAM (SEO-SLAM), a novel framework that enhances semantic mapping by integrating heterogeneous multimodal large language model (MLLM) agents. Our method enables scene adaptation while maintaining a semantically rich map. To improve computational efficiency, we propose an asynchronous processing scheme that significantly reduces the agents' inference time without compromising semantic accuracy or SLAM performance. Additionally, we introduce a multi-data association strategy using a cost matrix that combines semantic and Mahalanobis distances, formulating the problem as a Linear Assignment Problem (LAP) to alleviate perceptual aliasing. Experimental results demonstrate that SEO-SLAM consistently achieves higher semantic accuracy and reduces false positives compared to baselines, while our asynchronous MLLM agents significantly improve processing efficiency over synchronous setups. We also demonstrate that SEO-SLAM has the potential to improve downstream tasks such as robotic assistance.
Overview of the SEO-SLAM system: The pipeline begins with an RGBD image input, from which odometry is derived. Object measurements from an open-vocabulary detector are fed into a factor graph. MAP estimation is obtained through factor graph optimization. Landmarks from the current map are projected onto the camera frame and overlaid on the image. This composite image is used as input for the MLLM agents, which evaluate each landmark based on the current scene. For example, if the agents inform \( l_{j+1} \) is an erroneous landmark, factors related to \( l_{j+1} \) are removed. The yellow-colored region represents the primary SLAM pipeline, while the green-colored region denotes an asynchronous process that does not impact MAP runtime.