Language and Perception for Robotics

Language and Perception for Robotics

Open-Set Semantic-Set Estimation with Large Multimodal Models (Work in progress for RA-L 2025)

Problem Formulation

  • Problem: Identifying semantic states of the environment without predefined task domains
  • Challenge: Estimated states need to be relevant to task and consistent for downstream applications.
  • Solution: A graph-based prompting method that integrates the semantic-state space of a task and the latest semantic state into the inference process by Large Multimodal Models (LMMs)

Approach

  • Two-step approach: (1) Deriving the task’s semantic-state space through LMM-based task planning and (2) Using an LMM to estimate the scene’s state within that space
  • Chain-of-State prompting: A prompting method to guide LMMs in generating task plans by inferring sequential state transitions, inspired by classical planning
  • Affordance Graph: A structured representation of the semantic-state space, derived from processing task plan
  • Scene Graph: A structured representation of the latest semantic state, constructed by processing estimated states
  • EBNF-formatted prompts: Formal metalanguage-based prompts enabling zero-shot inference in both steps

Results

  • Statistical evaluation: Improvement of semantic state estimation accuracy by up to 122% and consistency by 70% on 30 real-world manipulation videos, compared to baselines and ablations
  • Real-world demonstrations: Validation of robustness of the approach through human interruption scenarios, showcasing its effectiveness for adaptive task execution

A Survey on Integration of Large Language Models with Intelligent Robots (ISR 2024)

Contribution

  • Taxonomy Proposal: Categorized large language model (LLM) applications in robotics by core elements: communication, perception, planning, and control
  • Communication Section: Analyzed how LLMs enhance robots’ communication capabilities in language understanding (interpretation and grounding) and language generation (task-oriented and non-task communication)
  • Prompt Guidelines: Provided guidelines and an example of a conversation prompt to achieve interactive grounding

Key Insights

  • The capability of LLMs to interpret and process formal languages
  • The necessity of a semantic world model for grounding natural language in LLMs
  • The importance of auxiliary knowledge to complement the commonsense knowledge in LLMs


© 2024. All rights reserved.