LeGo-Drive
Language-enhanced Goal-oriented
Closed-Loop End-to-End Autonomous Driving

1IIIT Hyderabad, 2University of Tartu
*Co-first authors Work done prior to current affiliation

Accepted at IROS 2024
Abu Dhabi, UAE

LeGo-Drive navigates to a language-based goal which is jointly optimized with trajectory parameters. The goal predicted for commands like "Park near the bus stop on the front left" can fall at a non-desirable location (top-right: Green), which may lead to a collision-prone trajectory.
Since the trajectory is the only component that directly "interacts" with the environment, we propose making the perception aware of the trajectory parameters which improves the goal location to a navigable location (bottom-right: Red).

Abstract

Existing Vision-Language Models (VLMs) produce long-term trajectory waypoints or directly control actions based on their perception input and language prompt. However, these VLMs are not explicitly aware of the constraints imposed by the scene or kinematics of the vehicle. As a result, the generated trajectories or control inputs are likely to be unsafe and/or infeasible. In this paper, we introduce LeGo-Drive, which aims to address these issues. Our key idea is to use the VLM to just predict a goal location based on the given language command and perception input, which is then fed to a downstream differentiable trajectory optimizer with learnable components. We train the VLM and the trajectory optimizer in an end-to-end fashion using a loss function that captures the ego-vehicle's ability to reach the predicted goal while satisfying safety and kinematic constraints. The gradients during the back-propagation flow through the optimization layer and make the VLM aware of the plannerā€™s capabilities, making more feasible goal predictions. We compare our end-to-end approach with a decoupled framework where the planner is just used at the inference time to drive to the VLM-predicted goal location and report a goal reaching Success Rate of 81%. We demonstrate the versatility of LeGo-Drive across various driving scenarios and navigation commands, highlighting its potential for practical deployment in autonomous vehicles.

The LeGo-Drive Architecture

End-to-End Approach

LeGo-Drive predicts a region-optimized desired location from the user-provided natural language input command together with the trajectory parameters while following the scene constraints. This is achieved by: first, predicting a language-based goal location within the predicted goal region segmentation. This is then, fed to the optimization-based downstream planner which estimates the optimization parameters of the trajectory. Further, both the modules are jointly trained by backpropagating the perception and planner loss in an end-to-end manner.


Case: Park-In

For navigation command like "Park behind the bike on the front right", the initial prediction of goal (in Green) from the perception module falls at a non-navigable location, i.e. on the curb edge. The model excels in improving the goal to a reachable location (in Red) when the perception module is made "aware" of the downstream trajectory planner.

Park-Out

The initial goal generated (in Green) for input prompt "Park out while keeping a safe distance from the parked bike," might not ensure a safe trajectory, potentially leading to collisions in the initial motion. However, the system is able to refine this goal (in Red) with a differentiably optimized trajectory leading to that goal when collectively trained with the planner.

Case: Compound Command

The efficacy of the proposed approach can be best seen for intricate case where a long-term motion is involved. A compound command like "Take a Right Turn at the intersection and Stop near the Food Stall" requires fine goal and trajectory estimation consecutively. This can be carried out by breaking the input prompt into atomic commands, typically using LLM, and execute them in-order. The model is then queried based on the number of atmoic commands, here twice. This further demonstrates the value of having an interepretable intermediate representation within the end-to-end model which helps in mitigating the extensive dependency of the planner module on perception, which could otherwise lead to impractical solutions stemming from perception inaccuracies.

Integration with LVLM

For extended functionality of generating high-level driving instructions catered to the current scene, we employ GPT-4V and provide it with the front camera image and an engineered prompt where we explain the driving setting and available actions. GPT-4V generates a suggested instruction command based on its rationale, which is then forwarded to our pipeline for trajectory planning and execution. As illustrated, the vision-language model is able to determine the best course of action from a range of potential driving maneuvers. It accurately identifies an obstruction ahead and recommends "switching to the left lane" to continue moving forward. With this recommended action, our pipeline is able to predict a collision-free goal point and an optimized trajectory leading towards it.


BibTeX


  @article{paul2024lego,
  title={LeGo-Drive: Language-enhanced Goal-oriented Closed-Loop End-to-End Autonomous Driving},
  author={Paul, Pranjal and Garg, Anant and Choudhary, Tushar and Singh, Arun Kumar and Krishna, K Madhava},
  journal={arXiv preprint arXiv:2403.20116},
  year={2024}
}