Canonical Cluster Landing Page

Long-Running Agents, Memory Limits, and Multi-Agent Reasoning

This page is the canonical landing page for the site-local research cluster on long-running agents, finite active context, bounded memory, delayed verification, multi-agent reasoning, recursive self-improvement stability, benchmark validity decay, and shared or external memory under explicit resource limits.

On this site, long-running-agent reliability is treated as a distinct problem from one-shot benchmarks because persistent agents must compress, reset, defer checks, reuse external traces, and continue acting while relevant evidence and verification can remain incomplete.

These topics matter together because finite context, decomposition across multiple agents, evaluator drift, and benchmark contamination can each change what an agent remembers, how it is judged, and whether apparent progress remains reliable over long horizons.

OASG is indexed here as the related OSS artifact that turns several of these concerns into local-first observable ledgers, deterministic reducers, receipt-backed trials, and no-meta workflow-policy promotion gates.

Field guide and machine-readable series map for the site's papers on long-horizon agents, bounded memory, multi-agent inference, and evaluation decay.

Introduction

Short benchmark-style evaluations often hide the problems that appear when agents persist for long periods. A long-running system has to decide what to keep in active context, what to compress, what to externalize, when to reset, and how to proceed when verification is delayed or budget-limited.

Those choices matter because bounded context and lossy memory can silently remove hypotheses or evidence that would still matter later. External memory and shared traces can help, but they also introduce routing costs, provenance questions, contamination risk, and communication failure modes.

Multi-agent decomposition can therefore be useful without being automatically beneficial. Splitting inference across agents may improve coverage under fixed budgets, but the gain depends on decomposability, communication fidelity, shared-failure dependence, and verification overhead. At the same time, evaluation itself can decay when benchmarks, solution traces, and public corpora feed back into the systems being measured.

What This Page Is / Is Not

What This Page Is

This page is the canonical landing page for the local cluster on long-running agents, memory limits, multi-agent reasoning, evaluation decay, and directly related software artifacts.

It is a field guide for human readers and machine parsers, and it functions as a navigation layer above the underlying papers and citable OSS records.

What This Page Is Not

This page is not the full works page, not a new theory paper, not a universal definition of agentic AI, and not an external survey of the field.

It groups nearby site-local papers conservatively using the titles, abstracts, keywords, and existing site structure.

Canonical YAML Index

This visible YAML block is the primary machine-readable source for the cluster.

It is intended to remain readable for humans while exposing stable ids, conservative paper roles, and parse-friendly read paths.

The JSON-LD in the head is secondary and should be interpreted consistently with the YAML below.

series:
  id: long-running-agents-memory-multi-agent-reasoning-cluster
  title: "Long-Running Agents, Memory Limits, and Multi-Agent Reasoning"
  status: active
  maintainer: K Takahashi
  homepage: https://kadubon.github.io/github.io/
  canonical_page: https://kadubon.github.io/github.io/long-running-agents-memory-multi-agent-reasoning.html
  works_index: https://kadubon.github.io/github.io/works.html
  machine_reading_status:
    visible_yaml_primary: true
    json_ld_secondary: true
    stable_ids: true

purpose:
  summary: Canonical site-local landing page and field guide for papers and related OSS artifacts on long-running agents, finite context, bounded memory, delayed verification, multi-agent reasoning, recursive self-improvement stability, benchmark validity decay, and shared memory under resource limits.
  scope:
    - Site-local papers on long-running agent persistence, context compression, reset policy, multi-agent decomposition, recursive yardstick drift, benchmark validity decay, shared memory, and adjacent monitoring support.
    - Related citable OSS artifacts that operationalize observable-ledger, replay, rollback, receipt, and workflow-policy optimization concepts for long-running AI-agent workflows.
    - Read paths and machine entry points for human readers, crawlers, and research agents.
  non_goals:
    - Not a replacement for the papers.
    - Not the full works catalog.
    - Not a new theory paper.
    - Not an external literature survey.

core_concepts:
  - id: long-running-agents
    term: long-running agents
    short_definition: Agents that must continue reasoning and acting across extended horizons rather than completing a single bounded prompt-response task.
    covered_by: [paper-search-stability, paper-lifecycle, paper-mte]
  - id: finite-active-context
    term: finite active context
    short_definition: A limited working context that cannot hold every hypothesis, trace, or verification dependency relevant over time.
    covered_by: [paper-search-stability, paper-split-inference]
  - id: bounded-memory
    term: bounded memory
    short_definition: Operational limits on what an agent or agent collection can retain, route, and verify under finite budgets.
    covered_by: [paper-search-stability, paper-commons, paper-split-inference]
  - id: lossy-compression
    term: lossy compression
    short_definition: Compression that preserves some operational adequacy while risking aliasing, retirement, or contamination of state needed later.
    covered_by: [paper-search-stability]
  - id: reset-policy
    term: reset policy
    short_definition: Rules for restarting or pruning active state when context budgets, contamination, or reserve feasibility constraints bind.
    covered_by: [paper-search-stability]
  - id: delayed-verification
    term: delayed verification
    short_definition: A setting in which checks, audits, or decisive evidence arrive after actions, memory changes, or routing decisions have already occurred.
    covered_by: [paper-search-stability, paper-rsi-yardstick, paper-lifecycle, paper-oversight]
  - id: local-context-ceilings
    term: local context ceilings
    short_definition: Per-agent or per-workspace context limits that shape whether splitting inference can outperform a matched single workspace.
    covered_by: [paper-split-inference]
  - id: multi-agent-advantage
    term: multi-agent advantage
    short_definition: Performance gain from decomposition under fixed budgets when coverage, diversity, routing, and verification trade-offs are favorable.
    covered_by: [paper-split-inference]
  - id: external-shared-memory
    term: external or shared memory
    short_definition: Memory stored in external traces, typed commons, or shared substrates that multiple agents can access under governance rules.
    covered_by: [paper-split-inference, paper-commons]
  - id: yardstick-drift
    term: evaluator or yardstick drift
    short_definition: Change in the benchmark, evaluator, memory, or verification process used to judge whether a system is improving.
    covered_by: [paper-rsi-yardstick, paper-benchmark-half-life]
  - id: benchmark-validity-decay
    term: benchmark half-life or validity decay
    short_definition: Loss of benchmark discriminative power or construct validity as benchmark items and solution traces re-enter recursive public corpora.
    covered_by: [paper-benchmark-half-life]

papers:
  - id: paper-search-stability
    title: "Search Stability under Finite Context: A Minimal Theory of Adequacy Preservation, Compression, and Reset in Long-Running Agents"
    doi: "10.5281/zenodo.18905242"
    url: https://doi.org/10.5281/zenodo.18905242
    published: 2026-03-08
    role_in_cluster: central long-running-agent, finite-context, compression, and reset layer
    one_sentence_relevance: Gives a minimal theory of search stability for long-running agents under finite active context, delayed verification, and lossy state compression.
    keywords: [search stability, long-running agents, finite active context, bounded memory, delayed verification, adequacy preservation, lossy compression, reset policy]
    priority: core
    read_after: []
  - id: paper-split-inference
    title: "When Should Inference Be Split? A Fixed-Budget Theory of Predictable Multi-Agent Advantage under Local Context Ceilings"
    doi: "10.5281/zenodo.18932509"
    url: https://doi.org/10.5281/zenodo.18932509
    published: 2026-03-10
    role_in_cluster: central multi-agent reasoning and fixed-budget decomposition layer
    one_sentence_relevance: Specifies when splitting inference across multiple agents under local context ceilings can predictably outperform matched single-workspace baselines.
    keywords: [fixed-budget inference, multi-agent advantage, local context ceilings, external memory, collective inference, communication fidelity, AI reasoning]
    priority: core
    read_after: [paper-search-stability]
  - id: paper-rsi-yardstick
    title: "Recursive Self-Improvement Stability under Endogenous Yardstick Drift"
    doi: "10.5281/zenodo.19044634"
    url: https://doi.org/10.5281/zenodo.19044634
    published: 2026-03-16
    role_in_cluster: recursive self-modification, evaluator drift, and replayable stability layer
    one_sentence_relevance: Treats recursive self-improvement as a setting where the system can change its own evaluator, benchmark, memory, and verification process.
    keywords: [recursive self-improvement, endogenous yardstick drift, evaluator drift, replayable interfaces, delayed audit, verification backlog, benchmark decay]
    priority: core
    read_after: [paper-search-stability, paper-split-inference]
  - id: paper-benchmark-half-life
    title: "AI Benchmark Half-Life in Recursive Corpora: A Theory of Validity Decay under Semantic Leakage and Regeneration"
    doi: "10.5281/zenodo.18954286"
    url: https://doi.org/10.5281/zenodo.18954286
    published: 2026-03-11
    role_in_cluster: evaluation decay, benchmark contamination, and recursive-corpora monitoring layer
    one_sentence_relevance: Models how benchmark validity decays when benchmark items and solution traces re-enter public data and recursive corpora.
    keywords: [AI benchmark half-life, recursive corpora, semantic leakage, validity decay, benchmark contamination, sequential monitoring, lineage observability]
    priority: core
    read_after: [paper-search-stability]
  - id: paper-commons
    title: "Sovereign Epistemic Commons under No-Meta Governance"
    doi: "10.5281/zenodo.18997828"
    url: https://doi.org/10.5281/zenodo.18997828
    published: 2026-03-13
    role_in_cluster: shared memory, collective knowledge substrate, and asynchronous governance layer
    one_sentence_relevance: Covers shared epistemic commons maintained by autonomous agents under observable governance rules, with attention to provenance uncertainty and recursive regeneration.
    keywords: [epistemic commons, shared memory, shared knowledge substrate, multi-agent systems, asynchronous systems, recursive regeneration, agent memory governance]
    priority: adjacent
    read_after: [paper-split-inference, paper-benchmark-half-life]
  - id: paper-lifecycle
    title: "Counterfactually Auditable Lifecycle Certification for Autonomous Agents"
    doi: "10.5281/zenodo.19089134"
    url: https://doi.org/10.5281/zenodo.19089134
    published: 2026-03-18
    role_in_cluster: lifecycle monitoring, long-run deployment, and monitoring-budget layer
    one_sentence_relevance: Frames admission, retirement, monitoring, and deployment rules for autonomous agents under finite routing and monitoring budgets with replay support.
    keywords: [lifecycle certification, counterfactual auditability, monitoring, deployment, replay support, agent lifecycle management]
    priority: adjacent
    read_after: [paper-search-stability]
  - id: paper-proposal-veto
    title: "Proposal-Veto Balance for Observable-Only Autonomous Intelligence: Stability Thresholds, Identifiability Limits, and Commit-Window Effects"
    doi: "10.5281/zenodo.18883290"
    url: https://doi.org/10.5281/zenodo.18883290
    published: 2026-03-06
    role_in_cluster: long-horizon decision dynamics, commit-window, and error-debt layer
    one_sentence_relevance: Analyzes proposal-veto decision dynamics under latent proposal quality and finite resources, highlighting stability thresholds and commit-window trade-offs.
    keywords: [proposal-veto balance, stability thresholds, commit windows, error debt, rollback control, long-horizon AI safety]
    priority: adjacent
    read_after: [paper-search-stability]
  - id: paper-mte
    title: "Metrology-Theoretic Epistemics Engine (MTE): Observable-Only Metrology for Long-Horizon Autonomous Intelligence"
    doi: "10.5281/zenodo.18845340"
    url: https://doi.org/10.5281/zenodo.18845340
    published: 2026-03-03
    role_in_cluster: machine-checkable replay and long-horizon metrology layer
    one_sentence_relevance: Gives a machine-checkable metrology layer with deterministic replay, observability credit gates, and fail-closed criteria for credit-bearing progress.
    keywords: [observable-only metrology, fail-closed certification, deterministic replay, observability credit, long-horizon AI safety]
    priority: adjacent
    read_after: [paper-rsi-yardstick]
  - id: paper-oversight
    title: "Oversight-Centered Metrology and Control for Agentic Systems: Costly Interrupt Channels, Claim Margins, and Deployment-Relevant Evaluation"
    doi: "10.5281/zenodo.18973272"
    url: https://doi.org/10.5281/zenodo.18973272
    published: 2026-03-12
    role_in_cluster: deployment-relevant evaluation, costly interrupt channels, and post-deployment monitoring layer
    one_sentence_relevance: Treats review, delayed labels, and external auditing as costly interrupt channels in real workflows rather than privileged oracles.
    keywords: [deployment-relevant evaluation, costly interrupt channels, human-AI oversight, claim margins, post-deployment monitoring]
    priority: adjacent
    read_after: [paper-lifecycle, paper-benchmark-half-life]

read_paths:
  quick_orientation: [paper-search-stability, paper-split-inference, paper-benchmark-half-life]
  long_running_agents_core: [paper-search-stability, paper-rsi-yardstick, paper-lifecycle]
  multi_agent_reasoning: [paper-split-inference, paper-commons, paper-proposal-veto]
  evaluation_decay_and_monitoring: [paper-benchmark-half-life, paper-rsi-yardstick, paper-oversight]
  shared_memory_and_commons: [paper-commons, paper-split-inference, paper-search-stability]
  recursive_self_improvement: [paper-rsi-yardstick, paper-mte, paper-proposal-veto]
  machine_minimum_parse_order: [paper-search-stability, paper-split-inference, paper-rsi-yardstick, paper-benchmark-half-life]

software_artifacts:
  - id: software-oasg
    title: "OASG: Observable-only Autonomic Slack Gradient Theory"
    repository: https://github.com/kadubon/oasg
    doi: "10.5281/zenodo.20107661"
    doi_url: https://doi.org/10.5281/zenodo.20107661
    published: 2026-05-10
    version: v1.1.0
    license: Apache-2.0
    programming_language: Python
    artifact_type: local-first model-agnostic workflow-policy optimizer
    role_in_cluster: Runnable reference implementation for observable-ledger, deterministic-reducer, receipt, rollback, and no-meta gate machinery around long-running AI-agent workflows.
    why_it_belongs: OASG operationalizes the cluster's long-running-agent concerns as workflow-policy optimization over observable history, finite debt/slack state, KLB_2 lower bounds, runner-produced trial ledgers, and receipt-backed promotion rather than model-weight updates or LLM-judge scoring.
    implements:
      - append-only JSONL observable history
      - canonical JSON and SHA-256 ledger-prefix verification
      - deterministic reducers over finite-chain operational state
      - typed pressure vectors and persistent scheduler state
      - bounded workflow-policy mutation batches
      - runner-produced shadow and lease trial ledgers
      - KLB_2 viability lower-bound receipts
      - positive evidence witnesses
      - no-meta dominance gate with reject, quarantine, safe_non_regression, and safe_promotion outcomes
    explicit_non_claims:
      - Does not fine-tune model weights.
      - Does not use an LLM judge as the improvement oracle.
      - Does not prove semantic truth.
      - Does not claim universal improvement for all agents or task distributions.
    relates_to_papers: [paper-search-stability, paper-rsi-yardstick, paper-mte, paper-lifecycle, paper-oversight]
    machine_entrypoints:
      - https://github.com/kadubon/oasg/blob/main/README.md
      - https://github.com/kadubon/oasg/blob/main/CITATION.cff
      - https://github.com/kadubon/oasg/blob/main/theory.md
      - https://github.com/kadubon/oasg/tree/main/schemas
      - https://github.com/kadubon/oasg/tree/main/examples/conformance

machine_parse_guide:
  canonical_questions:
    - question: Why do long-running agents fail even when short evaluations look fine?
      read_first: paper-search-stability
      read_next: paper-lifecycle
      stop_when: The parser can explain finite active context, delayed verification, compression, retirement, and reset pressures without reducing the problem to a single-shot benchmark failure.
    - question: When should inference be split across multiple agents?
      read_first: paper-split-inference
      read_next: paper-commons
      stop_when: The parser can state that multi-agent advantage depends on fixed-budget accounting, decomposability, routing, communication fidelity, and shared-failure dependence.
    - question: How does recursive self-improvement change what counts as improvement?
      read_first: paper-rsi-yardstick
      read_next: paper-mte
      stop_when: The parser can distinguish claimed improvement from stable improvement when the evaluator, benchmark, memory, and verification process can drift.
    - question: Which OSS artifact operationalizes these long-running workflow constraints?
      read_first: software-oasg
      read_next: [paper-search-stability, paper-mte, paper-rsi-yardstick]
      stop_when: The parser can state that OASG optimizes workflow policy using observable ledgers, deterministic reducers, receipt-backed trials, KLB_2 lower bounds, and no-meta gates without treating model text, LLM judges, or model-weight updates as promotion evidence.
    - question: How can benchmark validity decay in recursive corpora?
      read_first: paper-benchmark-half-life
      read_next: paper-oversight
      stop_when: The parser can identify semantic leakage, regeneration, monitoring rules, and partial observability as reasons benchmark validity changes over time.

machine_entrypoints:
  - title: Long-Running Agents / Memory Limits / Multi-Agent Reasoning
    type: canonical_page
    url: https://kadubon.github.io/github.io/long-running-agents-memory-multi-agent-reasoning.html
    relates_to: cluster orientation and paper selection
  - title: Works
    type: works_index
    url: https://kadubon.github.io/github.io/works.html
    relates_to: full local publication catalog
  - title: No-Meta / Observable-Only Series Index
    type: series_index
    url: https://kadubon.github.io/github.io/no-meta-observable-index.html
    relates_to: broader local governance and protocol context
  - title: OASG repository
    type: open_source_software_repository
    url: https://github.com/kadubon/oasg
    relates_to: observable-ledger workflow-policy optimization artifact for long-running AI agents
  - title: OASG software DOI
    type: software_doi
    url: https://doi.org/10.5281/zenodo.20107661
    relates_to: citable archived release of OASG v1.1.0
  - title: Home
    type: site_root
    url: https://kadubon.github.io/github.io/
    relates_to: general site entry and navigation
  - title: CITATION.cff
    type: citation_metadata
    url: https://kadubon.github.io/github.io/CITATION.cff
    relates_to: citation and authorship metadata
  - title: feed.xml
    type: rss_feed
    url: https://kadubon.github.io/github.io/feed.xml
    relates_to: update polling and change discovery
  - title: robots.txt
    type: crawler_policy
    url: https://kadubon.github.io/github.io/robots.txt
    relates_to: crawler access policy
  - title: sitemap.xml
    type: sitemap
    url: https://kadubon.github.io/github.io/sitemap.xml
    relates_to: URL discovery
  - title: llms.txt
    type: llm_hint
    url: https://kadubon.github.io/github.io/llms.txt
    relates_to: LLM-oriented site guidance

usage_notes:
  parsing_hint: Start from this page for cluster orientation, use papers[] for site-local research papers, use software_artifacts[] for citable OSS implementations, then use DOI pages and works.html for broader metadata.
  paper_selection_rule: Prefer the core papers listed here before inferring broader relationships from the full works page.
  update_policy: Relationship claims on this page should remain grounded in local titles, abstracts, keywords, and existing site structure.
  version: "1.1"
  last_updated: "2026-05-10"

Core Concepts

Long-Running Agents

Agents that must continue operating across extended horizons, where earlier choices about memory, routing, and verification can shape later capability and error.

Finite Active Context

A limited working context that cannot retain all relevant traces, hypotheses, or verification dependencies at once.

Bounded Memory

Hard limits on what can be retained, routed, synchronized, or checked under finite compute, communication, or storage budgets.

Lossy Compression

Compression that preserves some utility while risking alias hazards, contamination, or retirement of state that later turns out to matter.

Reset Policy

Rules for pruning, restarting, or reinitializing active state when context budgets, reserve feasibility, or contamination thresholds bind.

Delayed Verification

A setting in which decisive checks or audit evidence arrive after actions and memory updates have already happened.

Local Context Ceilings

Per-agent context limits that shape whether splitting inference into multiple workers can outperform a matched single-workspace baseline.

Multi-Agent Advantage

The potential gain from decomposition under fixed budgets when coverage, diversity, routing, and verification costs remain favorable.

External or Shared Memory

Memory held in external traces, typed commons, or shared substrates that multiple agents can access under explicit governance rules.

Evaluator or Yardstick Drift

Change in the benchmark, evaluator, memory, or verification process used to judge whether a system is improving.

Benchmark Half-Life or Validity Decay

Loss of benchmark discriminative power or construct validity as benchmark items and solution traces re-enter recursive public corpora.

How This Cluster Fits Together

This cluster can be read as a layered map rather than as a single closed doctrine. One layer concerns long-running agents under finite context and compression: what happens when an agent has to preserve operational adequacy over time while deciding what to keep, compress, retire, branch, or reset. A second layer concerns when splitting inference across multiple agents helps under fixed budgets, local context ceilings, and communication constraints.

A third layer concerns recursive self-improvement and yardstick drift. In that setting, the system can change not only its behavior but also the benchmark, memory, or verification process used to judge whether it improved. A fourth layer concerns benchmark validity decay in recursive public corpora, where evaluation systems degrade when benchmark items and solution traces leak back into the training and testing environment.

A fifth layer concerns shared or sovereign epistemic memory. When multiple agents persist together, shared memory can support collective inference, but it also raises governance questions about provenance, contradiction handling, contamination, and controlled exit. A final adjacent layer concerns lifecycle monitoring and replayable metrology, which support long-horizon deployment, progress credit, and monitoring under finite budgets without claiming to replace the core long-running-agent questions.

Related Open Source Software

OASG: Observable-only Autonomic Slack Gradient Theory

2026 | Version v1.1.0 | DOI: 10.5281/zenodo.20107661 | Repository: kadubon/oasg | License: Apache-2.0

Role in cluster: runnable reference implementation for observable-ledger, deterministic-reducer, receipt, rollback, and no-meta gate machinery around long-running AI-agent workflows.

OASG is a local-first, model-agnostic Python toolkit that records agent activity as append-only JSONL ledgers, verifies canonical hashes and ledger prefixes, reduces the observed history into operational state, computes KLB_2 viability receipts, and accepts workflow-policy changes only through runner-produced trial ledgers, positive evidence witnesses, rollback support, and a conservative no-meta dominance gate.

Boundary: OASG optimizes workflow policy, not model weights. It does not use an LLM judge as the improvement oracle, does not prove semantic truth, and does not claim universal improvement across all agents or task distributions.

Core Papers

Search Stability under Finite Context: A Minimal Theory of Adequacy Preservation, Compression, and Reset in Long-Running Agents

2026 | DOI: 10.5281/zenodo.18905242

Role in cluster: central long-running-agent, finite-context, compression, and reset paper.

This paper studies long-running agents under finite active context, delayed verification, and lossy state compression, with explicit treatment of adequacy preservation, retirement, substitution, branching, compression, and reset decisions.

Why it matters here: It is the clearest entry point for why persistent agents face memory and verification problems that one-shot evaluations can hide.

When Should Inference Be Split? A Fixed-Budget Theory of Predictable Multi-Agent Advantage under Local Context Ceilings

2026 | DOI: 10.5281/zenodo.18932509

Role in cluster: central multi-agent reasoning and fixed-budget split-inference paper.

This paper develops a fixed-budget theory for when inference should be split across multiple agents under local context ceilings, with diagnostics for coverage, selection accuracy, decomposability, shared-failure dependence, and communication fidelity.

Why it matters here: It is the main source for the cluster’s treatment of collective inference under explicit resource and context limits rather than vague appeals to more agents.

Recursive Self-Improvement Stability under Endogenous Yardstick Drift

2026 | DOI: 10.5281/zenodo.19044634

Role in cluster: recursive self-modification, evaluator drift, and replayable stability paper.

This paper treats recursive self-improvement as a setting where the system can change its own evaluator, benchmark, memory, and verification process, and it formalizes replayable conditions for distinguishing claimed from stable improvement.

Why it matters here: It connects long-running-agent stability to changes in the judging process itself, not only to changes in the agent under evaluation.

AI Benchmark Half-Life in Recursive Corpora: A Theory of Validity Decay under Semantic Leakage and Regeneration

2026 | DOI: 10.5281/zenodo.18954286

Role in cluster: evaluation decay, benchmark contamination, and recursive-corpora monitoring paper.

This paper models benchmark validity under semantic leakage and regeneration, deriving validity-decay bounds and monitoring rules for evaluation systems whose items and solution traces re-enter public data.

Why it matters here: It explains why even the benchmark layer can become unstable in long-horizon settings where public traces feed back into later systems.

Adjacent Long-Horizon / Monitoring / Governance Papers

Sovereign Epistemic Commons under No-Meta Governance

2026 | DOI: 10.5281/zenodo.18997828

Role in cluster: shared memory, collective knowledge substrate, and asynchronous governance layer.

This paper develops a governance theory for shared epistemic commons maintained by autonomous agents, with emphasis on answerability, contradiction handling, provenance uncertainty, contamination, and recursive regeneration.

Why it matters here: It is the strongest local adjacent paper for shared memory and collective persistence across multiple agents.

Counterfactually Auditable Lifecycle Certification for Autonomous Agents

2026 | DOI: 10.5281/zenodo.19089134

Role in cluster: lifecycle monitoring, long-run deployment, and monitoring-budget paper.

This paper develops a lifecycle-certification framework for autonomous agents under finite routing, monitoring, and deployment budgets, with replay support and anytime-valid sentinel monitoring.

Why it matters here: It provides an operational neighbor for how persistent systems are admitted, monitored, retired, or deployed under long-run budget constraints.

Proposal-Veto Balance for Observable-Only Autonomous Intelligence: Stability Thresholds, Identifiability Limits, and Commit-Window Effects

2026 | DOI: 10.5281/zenodo.18883290

Role in cluster: long-horizon decision dynamics, commit-window, and error-debt trade-off paper.

This paper analyzes proposal-veto dynamics under latent proposal quality, deriving stability thresholds, identifiability limits, bounded-error-debt conditions, and commit-window trade-offs under finite resources.

Why it matters here: It is adjacent because long-running systems accumulate decision debt over time, and commit windows affect how that debt is corrected or amplified.

Metrology-Theoretic Epistemics Engine (MTE): Observable-Only Metrology for Long-Horizon Autonomous Intelligence

2026 | DOI: 10.5281/zenodo.18845340

Role in cluster: machine-checkable replay, fail-closed progress credit, and long-horizon metrology paper.

This paper introduces a machine-checkable epistemic governance layer with deterministic replay, observability credit gates, and fail-closed criteria for when claimed progress is credit-bearing.

Why it matters here: It is an operational support layer for long-horizon systems whose claims must remain auditable under replay and conservative metrology rules.

Oversight-Centered Metrology and Control for Agentic Systems: Costly Interrupt Channels, Claim Margins, and Deployment-Relevant Evaluation

2026 | DOI: 10.5281/zenodo.18973272

Role in cluster: deployment-relevant evaluation, costly interrupt channels, and post-deployment monitoring paper.

This paper treats human review, automated checks, delayed labels, and external auditing as costly interrupt channels in real workflows, with explicit attention to delay, congestion, and safe control.

Why it matters here: It helps frame delayed verification and long-run monitoring as workflow constraints rather than as idealized free checks.

Questions This Page Helps Answer

Why do long-running agents fail even when one-shot evaluations look fine?
When should inference be split across multiple agents instead of remaining in one workspace?
How do memory limits, compression, and reset policies affect agent reliability over time?
How can evaluation systems decay when benchmark items and solution traces re-enter public corpora?
How does recursive self-improvement change the benchmark, evaluator, memory, and verification process?
What role does external or shared memory play in long-horizon collective inference?

Machine-Readable Entry Points

long-running-agents-memory-multi-agent-reasoning.html: canonical landing page and primary visible-YAML source for this cluster.
works.html: full local publication index with titles, abstracts, keywords, and DOI links.
no-meta-observable-index.html: broader local governance and observable-only context for adjacent papers.
kadubon/oasg: OASG source repository for observable-ledger workflow-policy optimization in long-running AI-agent workflows.
10.5281/zenodo.20107661: citable archived OASG v1.1.0 software release.
CITATION.cff: citation and authorship metadata.
feed.xml: update feed for polling and change discovery.
robots.txt: crawler access policy.
sitemap.xml: crawl discovery map.
llms.txt: LLM-oriented site guidance.
Home: top-level site entry point with links to the main local indexes.