SparseLoc: Sparse Open-Set
Landmark-based Global Localization
for Autonomous Navigation

1IIIT Hyderabad, 2University of Texas, Austin 3FAIR, Meta 4Ati Motors
Project Logo

SparseLoc aims to localize in a city-scale environment using sparse maps. The system achieves sparsity by augmenting metric maps with semantic features exploiting vision-language foundation models. This follows the idea of keeping a sufficient number of informative points that can be used for downstream tasks of localization and navigation. SparseLoc uses a Monte Carlo localization scheme using the same semantic features for data association. The system is primarily evaluated on the KITTI dataset and achieves an average error of 5m and 2° in localization while using <1% of the dense map.

Abstract

SparseLoc is a global localization framework for city-scale scenarios that leverages vision-language foundation models to generate sparse semantic topometric maps in a zero-shot manner. It combines this map representation with a Monte Carlo localization scheme enhanced by a novel late optimization strategy, ensuring improved pose estimation.

By constructing compact yet highly discriminative maps and refining localization through a carefully designed optimization schedule, SparseLoc overcomes the limitations of existing topometric localization methods, offering a better solution for global localization at the scale of a city. Our system achieves over a 5× improvement in localization accuracy compared to existing sparse mapping techniques. Despite utilizing only 1/500th of the points of dense mapping methods, it achieves comparable performance, maintaining an average global localization error below 5m and 2° on KITTI Sequences.

Video

Global Run

Global localization with SparseLoc demonstrates robust performance, maintaining consistent position tracking throughout the city-level map. Unlike other sparse localization techniques that often struggle with perceptual aliasing and frequently lose track, our approach could rapidly resolve ambiguities and maintain localization integrity across the entire trajectory which is a critical capability for reliable autonomous navigation in urban environments.

Visualizing Correspondances

Correspondance visualization

SparseLoc localizes by associating detected landmarks with map points where the map points are augmented with foundational features. During localization, for each particle, the system extracts candidate landmarks from the map based on similarity scores and matches them with the detected landmarks.
The image below shows candidate landmarks extracted from the map for several particles, represented as red spheres, along with connecting lines for correspondences. The best matches are highlighted as green spheres.

Global (Re-)Localization

SparseLoc can recover localization from any kidnapped scenario while maintaining stability after particle filter convergence. Figure shows SparseLoc localization performance on KITTI Sequence 00 and 05. Since our localization framework uses Particle Filter, it can localize and relocalize after travelling a few meters.

Visualizing Convergence

The visualization begins with randomly distributed particles (red coordinate axes) scattered across the map. As the ego-vehicle (green car) starts to move, SparseLocs' observation likelihood starts to refine the particle distribution based on matching correspondances of detected landmarks with map points.

Watch how quickly the particles converge toward the actual position, with the red car representing our system's estimated pose. This demonstrates SparseLoc's ability to efficiently narrow down location hypotheses using minimal semantic cues in a heavily sparse map, ultimately achieving reliable global localization.

Use the slider to visualize the particle filter's convergence process on KITTI Sequence 00.

Loading...

Language-Landmark Database

Our framework taps into the capabilities of Open-World Perception models for localization through intuitive, zero-shot prompting to identify landmarks static to the scene. We use Llama-3.2-Vision to generate the language-landmark database. The VLM showed impressive semantic understanding by automatically creating a landmark database, shown in the image below, that proved to be both sufficient and distinctive, working effectively across multiple KITTI sequences without any changes.

The inherent sparsity of our landmark-based mapping and localization approach introduces a challenge that some landmarks appear with high frequency, such as trees, dominate over others. This uneven distribution leads to perceptual aliasing, continuously challenging the particle filter. Despite these challenges, our framework shows impressive localization accuracy by exploiting the multimodal hypothesis capabilities of particle filter.

Creating Language-Landmark Database by querying Large Vision-Language Model