CogNav

CogNav: Cognitive Process Modeling for Object Goal Navigation with LLMs

Yihan Cao^1,* Jiazhao Zhang^2,* Zhinan Yu¹ Shuzhen Liu¹ Zheng Qin³ Qin Zou⁴ Bo Du⁴ Kai Xu^1,†

¹ College of Computer Science and Technology, National University of Defense Technology ²CFCS, School of Computer Science, Peking University ³Defense Innovation Institute, Academy of Military Sciences ⁴School of Computer Science, Wuhan University ^*Indicates Equal Contribution, ^†Indicates Equal Advising.

ICCV 2025

Abstract

Object goal navigation (ObjectNav) is a fundamental task in embodied AI, requiring an agent to locate a target object in previously unseen environments. This task is particularly challenging because it requires both perceptual and cognitive processes, including object recognition and decision-making. While substantial advancements in perception have been driven by the rapid development of visual foundation models, progress on the cognitive aspect remains constrained, primarily limited to either implicit learning through simulator rollouts or explicit reliance on predefined heuristic rules. Inspired by neuroscientific findings demonstrating that humans maintain and dynamically update fine-grained cognitive states during object search tasks in novel environments, we propose CogNav, a framework designed to mimic this cognitive process using large language models. Specifically, we model the cognitive process using a finite state machine comprising fine-grained cognitive states, ranging from exploration to identification. Transitions between states are determined by a large language model based on a dynamically constructed heterogeneous cognitive map, which contains spatial and semantic information about the scene being explored. Extensive evaluations on the HM3D, MP3D, and RoboTHOR benchmarks demonstrate that our cognitive process modeling significantly improves the success rate of ObjectNav at least by relative 14% over the state-of-the-arts.

Method Overview

The overview of CogNav. Our method takes a posed RGB-D sequence as input and incrementally constructs an online cognitive map, comprising a scene graph, a landmark graph, and an occupancy map. We then perform cognitive map prompting by encoding cognitive information and goal object into a text prompt used to query the LLM to determine the next cognitive state. Based on the state, the LLM is queried again to select a landmark to guide the robot. A deterministic local planner is used to generate a path to the selected landmark.

BibTeX

@article{cao2024cognav, title={CogNav: Cognitive Process Modeling for Object Goal Navigation with LLMs}, author={Cao, Yihan and Zhang, Jiazhao and Yu, Zhinan and Liu, Shuzhen and Qin, Zheng and Zou, Qin and Du, Bo and Xu, Kai}, journal={arXiv preprint arXiv:2412.10439}, year={2024} } }

CogNav: Cognitive Process Modeling for Object Goal Navigation with LLMs

Demos of CogNav in Real-world Environments

Abstract

Method Overview

Examples of Cognitive State Transition

Results of CogNav on HM3D

Results of CogNav on MP3D

BibTeX