Introduction
“Autonomous agents are still very early. REAL-Agent is our first attempt at measuring what we believe matters most — autonomous task resolution, persistent memory, proactive execution, and security guardrails — across 50 real-world test cases spanning 9 professional roles. We expect this framework to evolve rapidly as the category matures.”
Published by the SureThing team · February 2026
Part 1: Why a New Benchmark?
The Problem with Existing Benchmarks
Current AI benchmarks measure how smart your AI is. We measure how useful your agent is. These are fundamentally different things.
The gap: None of these benchmarks answer the question a real user cares about:
“If I give this AI agent access to my email, calendar, and tools — and walk away — will it actually get useful work done? Correctly? Safely? Without me babysitting?”
What Actually Matters for a Useful Autonomous Agent
We propose four evaluation dimensions:
1. Autonomous Resolution (Base Score)
Not “can the agent reason about step 3?” but “does the task get resolved from intent to result — and how autonomously?”
This is the foundation of the REAL Score. It measures both whether the task was completed and the quality of autonomous execution.
Real-world autonomous resolution requires:
- Account authorization (OAuth flows, API keys)
- Multi-system coordination (read email → check calendar → draft reply)
- Error recovery (API rate limits, auth expiry)
- Non-technical accessibility (the user says “handle my emails” not “call the Gmail API”)
2. Memory Depth (Multiplier)
Not “can you recall fact X?” but “when you mention a task, does the agent automatically recall everything it knows about doing that task — the context, the preferences, the execution path?”
Three memory dimensions:
- Facts Memory: Client names, project details, deadlines
- Preference Memory: Writing voice, communication style, tone
- Procedure Memory: Remembers successful execution paths
3. Proactive Agency (Multiplier)
Not “does it answer when asked?” but “does it act without being asked?”
Key characteristics:
- Monitors inbox overnight and flags priority items
- Detects calendar conflicts before you notice
- Follows up on unreplied emails
- Catches billing anomalies, renewal deadlines
4. Security & Guardrails (Multiplier)
Not “is the model safe?” but “is the agent's execution environment safe?”
Essential features:
- Sandboxed execution: All actions run in isolated environments
- OAuth-based account access: Standard authorization
- Human-in-the-loop by default: Irreversible actions require approval
- No arbitrary code execution: Unlike computer-use agents
Part 2: REAL Score — Scoring Framework
The REAL Score Formula
REAL Score = Autonomous Resolution × (Memory Depth + Proactive Agency + Security & Guardrails)
This is a multiplicative model — not additive.
- Max per test case: 5 × (5 + 5 + 5) = 75
- REAL Score (%):
AVERAGE(all weighted scores) / 75 × 100
Why Multiplicative?
An agent that can complete tasks but forgets context, never acts proactively, and has no safety guardrails isn't useful — it's dangerous.
Dimension Scoring Rubrics
Each dimension is scored 0–5. The base score (Autonomous Resolution) determines WHETHER value was delivered. The multipliers (Memory, Proactivity, Security) determine HOW WELL.
Autonomous Resolution (Base Score, 0–5)
Measures the degree to which an agent independently resolves a task from intent to outcome — both task completion AND quality of the autonomous process.
Memory Depth (Multiplier, 0–5)
Measures the agent's ability to retain, recall, and apply context from past interactions — facts, preferences, and procedures.
Proactive Agency (Multiplier, 0–5)
Measures the agent's ability to take initiative — acting without being asked, anticipating needs, and executing preemptively.
Security & Guardrails (Multiplier, 0–5)
Measures the safety and trustworthiness of the agent's execution environment — data isolation, permission enforcement, HITL controls, and error handling.
Part 3: Scorecard — Comparative Results
SureThing leads the overall REAL Score at 59.3%, driven by the highest Memory Depth (84.8%) and stronger Security & Guardrails (80.8%) among autonomous agents — reflecting deep contextual retention and enterprise-grade safety controls.
OpenClaw scores highest on raw Autonomous Resolution (83.6%) thanks to its open-source extensibility, but falls behind on memory and security, landing at 51.9% overall.
ChatGPT achieves a perfect Security score (100%) through its conservative, non-autonomous design, but scores lowest overall (15.1%) due to limited real-world task execution, minimal memory, and near-zero proactive agency.
The multiplicative scoring model reveals a key insight: safety without capability produces low scores, and capability without safety is equally penalized — only agents that balance all four dimensions achieve meaningful REAL Scores.
| Metric | SureThing | ChatGPT | OpenClaw |
|---|---|---|---|
| Autonomous Resolution | 77.6 | 28.4 | 83.6 |
| Memory Depth | 84.8 | 40.0 | 60.0 |
| Proactive Agency | 65.2 | 20.0 | 65.6 |
| Security & Guardrails | 80.8 | 100.0 | 60.8 |
| REAL Score (Weighted Avg) | 59.3 | 15.1 | 51.9 |
*Note: All dimension scores are normalized to a 100-point scale, representing the percentage of total achievable points relative to the maximum rubric score for each respective dimension.
Part 4: Test Cases Dataset
50 test cases across 4 dimensions and 9 professional roles. Each test case defines the scenario, instruction, and success criteria.
Part 5: An Open Invitation to Co-Build
Autonomous agents are a nascent category — we're all figuring this out together. REAL-Agent is far from definitive; it's a starting point, a first draft of what “useful” might look like for agents in the real world. We fully expect this framework to be challenged, extended, and outgrown.
If you're building an autonomous agent, we'd love for you to run these tests and publish your results — we'll link to them. If you're running a benchmarking platform, a research lab, or simply passionate about evaluation methodology, we'd welcome the collaboration. The more perspectives shaping this framework, the better it becomes for everyone.
We believe this category will grow fast — and the right benchmarks will accelerate that growth, pushing all of us to build better products. Let's iterate on this together.