Micheal's Substack
The Agentic Loop
MemRL and the Emerging Runtime-Learning Stack
0:00
-21:45

MemRL and the Emerging Runtime-Learning Stack

AI That Learns Without Changing

Executive Summary

The landscape of Artificial Intelligence is shifting from static, “train-once-deploy-once” models toward systems capable of continuous adaptation and “runtime learning.” This evolution is characterized by two primary architectural approaches: model-memory decoupling (exemplified by MemRL), which keeps core model weights frozen while optimizing external memory retrieval, and asynchronous weight-updating RL (exemplified by AReaL), which focuses on scalable infrastructure for continuous model training.

Recent research demonstrates that treating memory retrieval as a value-driven decision process significantly improves agent performance across multi-step environments. However, this technical progress coincides with a tightening regulatory environment—specifically the EU AI Act—and increasing operational scrutiny. As agentic AI projects face high cancellation risks due to cost and governance concerns, the field is consolidating around formal research frameworks at major venues like ICLR 2026, aiming to bridge the gap between experimental self-improvement loops and reliable, compliant production systems.

--------------------------------------------------------------------------------

The MemRL Framework: Model-Memory Decoupling

MemRL (Memory-based Reinforcement Learning) proposes a fundamental shift in how AI agents learn post-deployment. The core strategy is model-memory decoupling, which maintains the stability of the backbone Large Language Model (LLM) while allowing the system’s behavior to remain plastic through an external episodic memory.

Core Architecture

The MemRL system organizes memory into Intent–Experience–Utility triplets:

  • Intent: The task or query embedding.

  • Experience: The stored trajectory or solution trace, often summarized for reuse.

  • Utility: A learned value signal (Q-value) based on historical success.

Two-Phase Retrieval Process

Unlike passive similarity lookups, MemRL treats retrieval as an “active decision” to filter out semantic distractors that lead to failure:

  1. Coarse Semantic Filter: Shortlists potential memory candidates based on similarity.

  2. Value-Aware Selection: Uses learned utilities to decide which specific experiences to inject into the LLM context.

Performance Benchmarks

MemRL reports a consistent improvement over traditional memory baselines across multiple benchmarks, including BigCodeBench, Lifelong Agent Bench, ALFWorld, and HLE.

Metric

Baseline (MemP)

MemRL

Average Cumulative Success Rate (CSR)

0.760

0.798

ALFWorld Success Rate (GPT-4o-mini)

0.299

0.440

ALFWorld CSR (GPT-4o-mini)

N/A

0.680

The authors emphasize that MemRL’s primary advantage is making the retrieval policy “less noisy” over time, leading to smoother cumulative improvements and fewer performance regressions.

--------------------------------------------------------------------------------

The Lineage of Runtime Learning

MemRL represents a convergence of several research strands focused on learning without weight updates. These systems use different abstractions to manage long-term agent behavior.

  • Memento: Frames continual learning as a memory-augmented MDP, updating policy space rather than gradients. It achieved an 87.88% Pass@3 on GAIA validation.

  • Reflexion: An episodic buffer approach that stores self-reflective text. It reported a 91% pass@1 on HumanEval by iteratively retrying tasks with stored reflections.

  • Voyager: A non-parametric approach for embodied agents that builds an executable “skill library” (code). It emphasizes compositional reuse and interpretability.

  • ExpeL: Focuses on extracting reusable natural language “insights” from experience, specifically designed for API-only models that cannot be fine-tuned.

  • MemoryBank: Addresses the operational problem of memory relevance over time by incorporating a “forgetting curve” updating mechanism.

--------------------------------------------------------------------------------

AReaL: Asynchronous Weight-Update Infrastructure

While MemRL leaves the backbone fixed, the AReaL (Asynchronous Reinforcement Learning) project provides the infrastructure for continuous or repeated weight updates in agentic models.

Key Technical Features

  • Asynchronous Training: Developed by Tsinghua University and Ant Group, AReaL is a fully asynchronous RL training system built on ReaLHF.

  • Archon Engine: A PyTorch-native “5D parallel training engine” designed to maximize scaling and throughput.

  • OpenAI-Compatible Proxy: Allows existing agents built on LangChain, CAMEL-AI, or the OpenAI SDK to connect to an RL training loop by simply replacing a base_url.

Significant Claims

The AReaL repository reports the existence of AReaL-SEA, a self-evolving data synthesis engine. According to self-reported data, a 235B MoE model trained with this system “surpasses GPT-5” and is comparable to “Gemini 3.0 Pro” on certain benchmarks.

--------------------------------------------------------------------------------

Field Consolidation and Institutionalization

The transition of “runtime learning” from an engineering trick to a formal research discipline is marked by major academic events in early 2026.

ICLR 2026 Workshops

Dedicated venues at ICLR 2026 in Rio de Janeiro (April 26–27) highlight the shift toward lifelong learning:

  • Lifelong Agents Workshop: Focuses on adaptation, alignment, and evolution under real-world constraints. Featured speakers include Sergey Levine (UC Berkeley) and Azalia Mirhoseini (Stanford/Google DeepMind).

  • Recursive Self-Improvement (RSI) Workshop: Moves RSI from thought experiments to deployed systems, prioritizing the governance and evaluation of self-improvement loops.

--------------------------------------------------------------------------------

Governance, Regulation, and Operational Reality

As systems gain the ability to learn post-deployment, they fall under stricter regulatory and operational requirements.

The EU AI Act

The European Commission’s AI Act introduces specific mandates for “adaptive” systems:

  • Definition of Adaptiveness: Refers to self-learning capabilities that allow a system to change while in use.

  • Feedback Loop Management (Article 15(4)): Requires high-risk systems to reduce the risk of biased outputs becoming inputs for future operations.

  • Post-Market Monitoring (Article 72): Mandates a systematic plan to collect and analyze performance data throughout the system’s lifetime to ensure continuous compliance.

  • Timeline: The Act entered into force on August 1, 2024, and becomes fully applicable on August 2, 2026.

Operational Risks and Model Drift

The necessity for runtime learning is driven in part by the reality of model drift. Research comparing snapshots of GPT-4 between March and June 2023 showed:

  • Prime/composite identification dropped from 84% to 51%.

  • Significant reductions in compliance with chain-of-thought prompting.

  • Increased formatting mistakes in code generation.

Market and Liability Pressure

  • Legal Accountability: The Canadian Air Canada chatbot case established that companies are responsible for misinformation provided by their automated agents.

  • Project Viability: Gartner predicts that over 40% of agentic AI projects will be canceled by the end of 2027 due to escalating costs, unclear business value, or inadequate risk controls.

  • Risk Frameworks: Organizations are increasingly adopting the NIST AI Risk Management Framework (Govern, Map, Measure, Manage) to provide a “translation layer” between research prototypes and deployable systems.

Discussion about this episode

User's avatar

Ready for more?