LeGo-Drive: Language-enhanced Goal-oriented
Closed-Loop End-to-End Autonomous Driving

1IIIT Hyderabad, 2University of Tartu
*Co-second authors

LeGo-Drive navigates to a language-based goal which is jointly optimized with trajectory parameters. The goal predicted for commands like "Park near the bus stop on the front left" can fall at a non-desirable location (top-right: Green), which may lead to a collision-prone trajectory.
Since the trajectory is the only component that directly "interacts" with the environment, we propose making the perception aware of the trajectory parameters which improves the goal location to a navigable location (bottom-right: Red).

Abstract

Existing Vision-Language models (VLMs) estimate either long-term trajectory waypoints or a set of control actions as a reactive solution for closed-loop planning based on their rich scene comprehension. However, these estimations are coarse and are subjective to their "world understanding" which may generate sub-optimal decisions due to perception errors. In this paper, we introduce LeGo-Drive, which aims to address this issue by estimating a goal location based on the given language command as an intermediate representation in an end-to-end setting. The estimated goal might fall in a non-desirable region, like on top of a car for a parking-like command, leading to inadequate planning. Hence, we propose to train the architecture in an end-to-end manner, resulting in iterative refinement of both the goal and the trajectory collectively. We validate the effectiveness of our method through comprehensive experiments conducted in diverse simulated environments. We report significant improvements in standard autonomous driving metrics, with a goal reaching Success Rate of 81%. We further showcase the versatility of LeGo-Drive across different driving scenarios and linguistic inputs, underscoring its potential for practical deployment in autonomous vehicles and intelligent transportation systems.

The LeGo-Drive Architecture

End-to-End Approach

LeGo-Drive predicts a region-optimized desired location from the user-provided natural language input command together with the trajectory parameters while following the scene constraints. This is achieved by: first, predicting a language-based goal location within the predicted goal region segmentation. This is then, fed to the optimization-based downstream planner which estimates the optimization parameters of the trajectory. Further, both the modules are jointly trained by backpropagating the perception and planner loss in an end-to-end manner.


Case: Park-In

For navigation command like "Park behind the bike on the front right", the initial prediction of goal (in Green) from the perception module falls at a non-navigable location, i.e. on the curb edge. The model excels in improving the goal to a reachable location (in Red) when the perception module is made "aware" of the downstream trajectory planner.

Park-Out

The initial goal generated (in Green) for input prompt "Park out while keeping a safe distance from the parked bike," might not ensure a safe trajectory, potentially leading to collisions in the initial motion. However, the system is able to refine this goal (in Red) with a differentiably optimized trajectory leading to that goal when collectively trained with the planner.

Case: Compound Command

The efficacy of the proposed approach can be best seen for intricate case where a long-term motion is involved. A compound command like "Take a Right Turn at the intersection and Stop near the Food Stall" requires fine goal and trajectory estimation consecutively. This can be carried out by breaking the input prompt into atomic commands, typically using LLM, and execute them in-order. The model is then queried based on the number of atmoic commands, here twice. This further demonstrates the value of having an interepretable intermediate representation within the end-to-end model which helps in mitigating the extensive dependency of the planner module on perception, which could otherwise lead to impractical solutions stemming from perception inaccuracies.

Integration with LVLM

For extended functionality of generating high-level driving instructions catered to the current scene, we employ GPT-4V and provide it with the front camera image and an engineered prompt where we explain the driving setting and available actions. GPT-4V generates a suggested instruction command based on its rationale, which is then forwarded to our pipeline for trajectory planning and execution. As illustrated, the vision-language model is able to determine the best course of action from a range of potential driving maneuvers. It accurately identifies an obstruction ahead and recommends "switching to the left lane" to continue moving forward. With this recommended action, our pipeline is able to predict a collision-free goal point and an optimized trajectory leading towards it.


BibTeX


  @article{paul2024lego,
  title={LeGo-Drive: Language-enhanced Goal-oriented Closed-Loop End-to-End Autonomous Driving},
  author={Paul, Pranjal and Garg, Anant and Choudhary, Tushar and Singh, Arun Kumar and Krishna, K Madhava},
  journal={arXiv preprint arXiv:2403.20116},
  year={2024}
}