CogNav: Cognitive Process Modeling for Object Goal Navigation with LLMs

Yihan Cao1,*       Jiazhao Zhang2,*       Zhinan Yu1       Shuzhen Liu1 Zheng Qin3      Qin Zou4      Bo Du4      Kai Xu1,†
1 College of Computer Science and Technology, National University of Defense Technology       2CFCS, School of Computer Science, Peking University       3Defense Innovation Institute, Academy of Military Sciences        4School of Computer Science, Wuhan University        *Indicates Equal Contribution,  Indicates Equal Advising.
示例图片

Contributions

  • CogNav is an effective cognitive process modeling for ObjectNav via exploiting the commonsense and spatial reasoning capability of LLMs.
  • CogNav designs the fine-grained cognitive states and the prompts for grounded reasoning of state transitions with an LLM.
  • CogNav builds a heterogeneous cognitive map representation that is constructed online and can be corrected by prompting an LLM to ensure high map accuracy.
  • CogNav achieves state-of-the-art performance in both simulated, and exhibits strong generalizability in real-world environments.

Demos of CogNav in Real-world Environments

Abstract

Object goal navigation (ObjectNav) is a fundamental task in embodied AI, requiring an agent to locate a target object in previously unseen environments. This task is particularly challenging because it requires both perceptual and cognitive processes, including object recognition and decision-making. While substantial advancements in perception have been driven by the rapid development of visual foundation models, progress on the cognitive aspect remains constrained, primarily limited to either implicit learning through simulator rollouts or explicit reliance on predefined heuristic rules. Inspired by neuroscientific findings demonstrating that humans maintain and dynamically update fine-grained cognitive states during object search tasks in novel environments, we propose CogNav, a framework designed to mimic this cognitive process using large language models. Specifically, we model the cognitive process using a finite state machine comprising fine-grained cognitive states, ranging from exploration to identification. Transitions between states are determined by a large language model based on a dynamically constructed heterogeneous cognitive map, which contains spatial and semantic information about the scene being explored. Extensive evaluations on the HM3D, MP3D, and RoboTHOR benchmarks demonstrate that our cognitive process modeling significantly improves the success rate of ObjectNav at least by relative 14% over the state-of-the-arts.

Method Overview

CogNav

The overview of CogNav. Our method takes a posed RGB-D sequence as input and incrementally constructs an online cognitive map, comprising a scene graph, a landmark graph, and an occupancy map. We then perform cognitive map prompting by encoding cognitive information and goal object into a text prompt used to query the LLM to determine the next cognitive state. Based on the state, the LLM is queried again to select a landmark to guide the robot. A deterministic local planner is used to generate a path to the selected landmark.

Examples of Cognitive State Transition

CogNav

Results of CogNav on HM3D

Results of CogNav on MP3D

BibTeX