Vision-Language-Action Models (VLA): The Cognitive Layer

We have a body (ROS 2), a world (Gazebo), and a brain that can navigate (Nav2). Now we need a Mind—something that understands intent.

The Semantic Gap

Traditional Robotics: Knows "Move to x=5, y=10". Does NOT know "Go to the kitchen."
LLMs (Large Language Models): Know what a "kitchen" is, and that "food" is usually found there.

Bridging this gap is the goal of VLA.

VLA Pipeline

sequenceDiagram
    User->>+Whisper: "Find me a snack"
    Whisper-->>-LLM: Text: "Find me a snack"
    LLM->>+Scene: What objects do I see?
    Scene-->>-LLM: [Apple, Book, Cup]
    LLM->>LLM: Reasoning: Snack = Apple
    LLM->>+Nav2: navigate_to("Apple")
    Nav2-->>-LLM: Arrived
    LLM->>Arm: pick_up("Apple")

Voice as an Interface

We start with speech. OpenAI Whisper is the state-of-the-art for Speech-to-Text (ASR). It runs fast on GPUs.

import whisper
model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(result["text"]) # "Go to the kitchen and find me a snack."

From Text to ROS Actions

How do we turn "Find me a snack" into cmd_vel?

1. The Prompt Engineering Approach

We can give an LLM (like GPT-4o or Llama 3) a list of available ROS Actions as tools.

System Prompt:

You are a robot assistant. You have the following tools:

navigate_to(location_name)

scan_for_objects()

pick_up(object_name)

The user says: "Find me a snack." Plan a sequence of actions.

LLM Output:

navigate_to("kitchen")

scan_for_objects()

pick_up("apple")

2. End-to-End VLA Models (RT-2, PaLM-E)

Google's RT-2 (Robotic Transformer) is a model trained on both internet text/images AND robot trajectory data. It outputs direct robot actions (effector tokens) instead of just text.

Input: Image of a desk + Text "Pick up the extinct animal".
Output: The robot arm moves to the plastic dinosaur.

This is "Embodied Reasoning". The model understands "extinct animal" = "dinosaur" and connects it to the visual pixels of the dinosaur toy.

Implementation Details

We will build a simple "Cognitive Node" in Python.

Listener: Subscribes to /voice_command.
Thinker: Sends text to an LLM API (OpenAI or Local Llama via Ollama).
Executor: Parses the LLM's JSON response and calls ROS 2 Action Servers.

# executor_node.py
def execute_plan(plan):
    for step in plan:
        if step['action'] == 'navigate':
            nav_client.send_goal(step['target'])
        elif step['action'] == 'pick':
            manipulation_client.send_goal(step['object'])

Safety and Hallucination

LLMs hallucinate. You do NOT want a robot to hallucinate a command like "Jump out the window."

Guardrails: Code layers that sanity-check the LLM's output.
Grounding: Verifying that the object "apple" actually exists in the visual scene before trying to grab it.

In the capstone, we will put this all together.

The Semantic Gap​

Voice as an Interface​

From Text to ROS Actions​

1. The Prompt Engineering Approach​

2. End-to-End VLA Models (RT-2, PaLM-E)​

Implementation Details​

Safety and Hallucination​