Language and Perception for Robotics

Projects | | Links:

Open-Set Semantic-Set Estimation with Large Multimodal Models (Work in progress for RA-L 2025)

Problem: Identifying semantic states of the environment without predefined task domains
Challenge: Estimated states need to be relevant to task and consistent for downstream applications.
Solution: A graph-based prompting method that integrates the semantic-state space of a task and the latest semantic state into the inference process by Large Multimodal Models (LMMs)

Two-step approach: (1) Deriving the task’s semantic-state space through LMM-based task planning and (2) Using an LMM to estimate the scene’s state within that space
Chain-of-State prompting: A prompting method to guide LMMs in generating task plans by inferring sequential state transitions, inspired by classical planning
Affordance Graph: A structured representation of the semantic-state space, derived from processing task plan
Scene Graph: A structured representation of the latest semantic state, constructed by processing estimated states
EBNF-formatted prompts: Formal metalanguage-based prompts enabling zero-shot inference in both steps

Statistical evaluation: Improvement of semantic state estimation accuracy by up to 122% and consistency by 70% on 30 real-world manipulation videos, compared to baselines and ablations
Real-world demonstrations: Validation of robustness of the approach through human interruption scenarios, showcasing its effectiveness for adaptive task execution

Taxonomy Proposal: Categorized large language model (LLM) applications in robotics by core elements: communication, perception, planning, and control
Communication Section: Analyzed how LLMs enhance robots’ communication capabilities in language understanding (interpretation and grounding) and language generation (task-oriented and non-task communication)
Prompt Guidelines: Provided guidelines and an example of a conversation prompt to achieve interactive grounding

The capability of LLMs to interpret and process formal languages
The necessity of a semantic world model for grounding natural language in LLMs
The importance of auxiliary knowledge to complement the commonsense knowledge in LLMs