Spotlight video showcasing our multi-critic approach for whole-body end-effector control

Abstract

Learning whole-body control for locomotion and arm motions in a single policy has challenges, as the two tasks have conflicting goals. For instance, efficient locomotion typically favors a horizontal base orientation, while end-effector tracking may benefit from base tilting to extend reachability. Additionally, current Reinforcement Learning (RL) approaches using a pose-based task specification lack the ability to directly control the end-effector velocity, making smoothly executing trajectories very challenging. To address these limitations, we propose an RL-based framework that allows for dynamic, velocity-aware whole-body end-effector control. Our method introduces a multi-critic actor architecture that decouples the reward signals for locomotion and manipulation, simplifying reward tuning and allowing the policy to resolve task conflicts more effectively. Furthermore, we design a twist-based end-effector task formulation that can track both discrete poses and motion trajectories. We validate our approach through a set of simulation and hardware experiments using a quadruped robot equipped with a robotic arm. The resulting controller can simultaneously walk and move its end-effector and shows emergent whole-body behaviors, where the base assists the arm in extending the workspace, despite a lack of explicit formulations.

Model Architecture

Given a randomly sampled start and goal pose, the desired end-effector and base twist, along with the desired foot height, are provided as commands to the policy by the command generator. Rewards are categorically computed and consumed by separate critics, leading to individual value functions. The advantage is estimated per critic, normalized, and summed to compute the total advantage. Policy optimization is performed using Proximal Policy Optimization (PPO) to optimize the teacher policy.

Model Architecture

Experimental Results

We evaluate our approach in both simulation and on a real quadruped robot with a robotic arm. The policy is trained in simulation using the proposed multi-critic architecture and twist-based end-effector control. We assess the performance of the learned controller in various tasks, including static pose tracking, dynamic trajectory following, and obstacle negotiation. The results demonstrate that our method achieves superior end-effector tracking accuracy while maintaining stable locomotion compared to baseline approaches. Additionally, we observe emergent behaviors where the robot utilizes its base to assist the arm in reaching targets beyond its nominal workspace.

Tracking Performance of Different Trajectory Types on Hardware

To assess the tracking performance of the policy, we evaluate its ability to track the following trajectory types:

  • Straight line: The policy is commanded to move 30 cm along each axis.
  • Circle: The policy is commanded to move in a circle on the YZ plane with a radius of 20 cm at a fixed distance in front of the robot.
  • Semicircle around robot: To evaluate trajectories over an extended workspace, the policy is commanded a trajectory of radius 60 cm radius around the robot.

Velocity: Slow

Straight Line

Circle

Semicircle

Velocity: Medium

Straight Line

Circle

Semicircle

Velocity: Fast

Straight Line

Circle

Semicircle

Loco-manipulation

We demonstrate precise end-effector tracking during simultaneous locomotion by performing a end-effector trajectory tracking task during locomotion.

Chicken head tracking

We demonstrate precise end-effector tracking during simultaneous locomotion by performing a chicken head control task.

BibTeX

@article{vijayan2025multi,
  author    = {Vijayan, Aravind Elanjimattathil and Cramariuc, Andrei and Risiglione, Mattia and Gehring, Christian and Hutter, Marco},
  title     = {Multi-critic Learning for Whole-body End-effector Twist Tracking},
  journal   = {Conference on Robot Learning (CoRL)},
  year      = {2025},
}