Icon Seizing Serendipity:
Exploiting the Value of Past Success in Off-Policy Actor-Critic

Anonymous Author(s)


Learning high-quality $Q$-value functions plays a key role in the success of many modern off-policy deep reinforcement learning (RL) algorithms. Previous works primarily focus on addressing the value overestimation issue, an outcome of adopting function approximators and off-policy learning. Deviating from the common viewpoint, we observe that $Q$-values are often underestimated in the latter stage of the RL training process, potentially hindering policy learning and reducing sample efficiency. We find that such a long-neglected phenomenon is often related to the use of inferior actions from the current policy in Bellman updates as compared to the more optimal action samples in the replay buffer. To address this issue, our insight is to incorporate sufficient exploitation of past successes while maintaining exploration optimism. We propose the Blended Exploitation and Exploration (BEE) operator, a simple yet effective approach that updates $Q$-value using both historical best-performing actions and the current policy. Based on BEE, the resulting practical algorithm BAC outperforms state-of-the-art methods in over 50 continuous control tasks and achieves strong performance in failure-prone scenarios and real-world robot tasks.

Real-World Validation

Real-world validation of BAC using a cost-effective D'Kitty robot: tasked with traversing complex terrains and reaching goal points. BAC outperforms both TD3 and SAC, mastering a stable, natural gait; TD3 opts for a low stance, such as a knee walk, whereas SAC exhibits more oscillatory gait patterns.

Smooth Road


Rough Stone Road


Uphill Stone Road





BAC excels in the complex DogRun task with 38-dimensional continuous action space—a challenge that stumps other methods. This is, to the best of our knowledge, the first documented result of model-free methods effectively tackling the challenging Dog task.



In the tough HumanoidStandup task, BAC achieves 2.1x evaluation scores against the strongest baseline. Scores ~280,000 at 2.5M steps and ~360,000 at 5M steps. BAC stands strong, while others falter—SAC wobbles, DAC sits, and RRS rolls.


Behavior Visualizations of BAC in a Variety of Benchmark Tasks

Benchmark Results

We evaluate the BEE operator on over 50 diverse benchmark tasks from MuJoCo, DMControl, Meta-World, Adroit, MyoSuite, ManiSkill2, and Shadow Dexterous Hand. It excels in both locomotion and manipulation tasks. As a versatile plugin, it seamlessly enhances performance with various policy optimization methods, shining in model-based and model-free paradigms.

BAC in MuJoCo Benchmarks

We evaluate BAC on 4 continuous control tasks from MuJoCo. BAC outperforms all baselines in both eventual performance and sample efficiency.

MB-BAC in MuJoCo Benchmarks

We evaluate MB-BAC on four MuJoCo continuous control tasks. MB-BAC learns faster than MBRL baselines and achieves comparable asymptotic performance with model-free methods.

BAC & BEE-TD3 in DMControl Benchmark Tasks

We benchmark BAC and its variant, BEE-TD3, on 15 continuous control tasks from DMControl. BAC successfully solves many challenging tasks like HumanoidRun, DogWalk, and DogRun, where both SAC and TD3 fail. Also, BEE-TD3 boosts TD3's performance by a large margin.

BAC in Meta-World Benchmark Tasks

We also benchmark BAC on 14 goal-conditioned manipulation tasks from Meta-World. BAC consistently outperforms SAC and TD3 across all tasks.

BAC in Adroit Benchmark Tasks

BAC consistently outperforms on 3 dexterous hand manipulation tasks from Adroit.

BAC in MyoSuite Benchmark Tasks

Notably, BAC shines on 5 challenging tasks in controlling muscles and bones from MyoSuite.

BAC in ManiSkill2 Benchmark Tasks

BAC shows superiority on 5 manipulation tasks from ManiSkill2.

BAC in Shadow Dexterous Hand Benchmark Tasks

BAC consistently outperforms on 3 dexterous hand manipulation tasks from Shadow Dexterous Hand Suite.