Stateless Agent Matters • ✨ EFLx ☁️

The moment the term Agent Harness was coined, it meant the concept had entered the realm of the “named”. Once a practice is named—once people realize that existing language cannot precisely describe it and thus borrow a new term—it begins its transition from a “niche” practice to a “best practice”.

I still remember the days when people would scramble to pack source code into long-form text, frantically pasting it into the ChatGPT chat box. Even now, amidst the dread of Claude Code launching ten rounds of an Explore Agent, I sometimes copy and paste simple code problems into Google AI Studio ↗ just to get a blissfully innocent response.

Regardless, the debate over whether reasoning models require Prompt Engineering has ended. A set of well-crafted prompts and tool schemas can grant a model extraordinary capabilities, and this—in a quiet, unobtrusive way—is becoming the consensus.

Wait. So what is a Stateless Agent?#

My perspective originates from a strange sensation regarding “Memory”. We first saw memory features in ChatGPT and Google Gemini; later, Claude Code added the so-called MEMORY.md. Some LLM clients, like alma ↗, provided impressive memory functions. On April 16, the new version of OpenAI Codex ↗ also introduced memory capabilities.

Actually, the definition of memory is quite broad. Most approaches are relatively “primitive”, essentially revolving around one or more text files that the LLM summarizes and updates in the background. There are also “academic” efforts, such as Nowledge Mem ↗, which constructs a vast array of concepts around the aggregation and distillation of memory.

But I started to ponder a question: What is the role of memory in the LLM workflow?

When we ask an LLM to work on a codebase, the process generally looks like this:

The human provides instructions or a goal, perhaps along with constraints and a process. At this point, the model possesses a portion of the context.
The Agent gathers additional context through various means until the context is large enough to reach a certain threshold.
The Agent begins to think about how to proceed, and the context expands further to the next threshold.
Finally, the Agent builds the solution.

Viewed this way, Memory is just a form of context. Context also includes other code in the repository, the Agent’s inquiries to the user, and so on.

In communication, people are mainly bothered by two things: the other party’s lack of comprehension, and a lack of “tacit understanding” (filling in the missing context). Because LLMs aren’t powerful enough yet, redundant context is often used to compensate. This has led to “spec-driven vibe coding”—a programming method that relies on a large volume of descriptions (whether handwritten or AIGC) to ensure the output of vibe coding aligns with the intent. Similarly, people want “tacit understanding,” or to maintain consistency, or believe that diachronic accumulation leads to progress. Thus, Memory was introduced.

One could say both approaches are quite correct; practice has proven they yield results. But the problem is: these practices are intended to make an otherwise potentially incorrect result (the product of vibe coding) correct. Therefore, there is more than one way to ensure correctness; furthermore, the real task should be finding the optimal way to ensure that potentially incorrect results become correct.

This is easy to understand: if we only care about whether the LLM outputs correct code, we could just provide increasingly detailed Plans and Specifications, eventually telling it exactly how to write the code. The LLM then merely becomes a tool-caller, responsible for applying that code to the repository.

I know this “limit” (in the mathematical sense) might seem nonsensical, but it proves a point: we should seek effective methods, and simply “making the LLM’s output correct” does not necessarily mean effective.

Robert Englander wrote in Engineering Alignment in Probabilistic Generation ↗ that the goal of software engineering has always been correctness. LLMs have changed how errors occur. In a probabilistic generation process, correctness degrades at the boundaries between different levels (from Intent to Spec, from Spec to CoT, from CoT to code generation). This degradation manifests as subtle semantic shifts, the filling of implicit assumptions, and the compression of ambiguity. Unlike explicit errors in traditional systems, an LLM-generated system might compile successfully, pass all tests, and run smoothly, yet still have semantically drifted from the original intent.

In that article, the author proposes five planes and emphasizes governing the boundaries between them. But I want to look at this from another perspective. The ontology of an LLM lies in predictive generation, which dictates that n-shot is the most powerful way to influence it. In other words, the content generated by an LLM largely cannot be measured by “correctness.” Correctness does not exist within the LLM; it exists within the context. A certain context implies a certain probability distribution of outputs. Relying solely on the “extreme pressure” of context cannot make an LLM 100% generate a specific output, nor does it guarantee that the low-probability “hallucinations” in the distribution will be more “correct.”

Correctness doesn’t stem from “engineering discipline” either. Specifications don’t actually manufacture correctness; they just shift from one error to another, trending toward so-called correctness through a spiral of development. Every “correct” state is actually an “error” relative to the previous harness, and this series does not converge.

If we analyze further, human intent itself carries uncertainty, and there is always a “remnant of the Real” that cannot be described by language. This view might not be popular in software engineering, or even in the natural sciences. But if it is true—and I believe it is—then even without LLM assistance, code written by humans is bound to be not entirely correct.

Therefore, my conclusion is: Correctness is located neither in the LLM nor in the Specification. It resides in the collective effect of all contexts and may also rely on the psyche of the Reviewer.

If this is true, what does it mean? First, we can no longer pin our hopes for correctness on providing a complete, non-deviating intent. This is why I propose the distinction between Stateful and Stateless context. When we write a Specification or introduce memory into the context, we are actually performing a derivation of our intent. This is why I call it “stateful.” Memory is derived from historical records; the Spec is derived from the developer’s intent. The Agent no longer thinks in terms of the source code, but instead obeys the state represented by the Spec. In other words, developers are still almost programming manually—only the language has shifted from a programming language to natural language—and they must rely on various constraints to compensate for the limitations of natural language. Moving from Intent to Spec creates a semantic drift at the boundary. When using Memory, one similarly sacrifices the “Markovian” nature of the “here-and-now” intent, hoping the model gains an understanding of the intent’s perimeter through historical data.

This is what I find strange.

I believe the context should strictly consist only of my description of intent, and the LLM should be trained to possess this capability: it should be able to identify whether the current context allows it to write sufficiently correct (rather than perfectly correct) code. If it is insufficient, it should use tools like Explore, Ask User, or something like context7 ↗ to further acquire context until it is confident.

Why is this the right way? Clearly, people have placed too much focus on “state.” The LLM is a black box, but people unconsciously assume it is a black box like a function-where if the input is precise enough, the output will be too. However, the LLM is essentially a Schrödinger-like black box. When this nature surfaces, people start adjusting the input, increasing precision, or introducing diachronicity—adding twelve steps where each step is “precisely” defined, or maintaining a memory file that constantly rolls and updates.

This is actually a neurotic behavior—I say this without malice or intent to attack, because according to psychoanalytic theory, everyone is structured as a neurotic, psychotic, or perverse, may also include autism (categories still debated). The point of highlighting this is to show that such behavior serves to satisfy one’s own psychological energy; it aligns with the unconscious desire to be “a bit more precise,” “a bit more comfortable,” or to “question, overrule, and rebuild.” This doesn’t mean my proposed method is free of such factors, but when “satisfaction” becomes the substantive drive, I find myself compelled to step back and reflect.

The Stateless Methodology#

After the discussion above, my point becomes clear: supplementing context based on Specifications or Memory belongs to a “stateful” derivative practice in the vibe coding realm. A stateless approach, however, relies on the LLM’s decision-making ability, keeping the “be a bit more precise” or “question-and-restart” cycle within the Agent Loop itself.

A typical Stateless Agent Loop looks like this:

The user describes the intent, possibly including constraints and processes.
The Agent performs an Explore to obtain facts from the code source.
The Agent may use tools to gain more context, such as reading the documentations.
The Agent asks the user for clarification on key issues.
Steps 2-4 are repeated.
Plan and build up.

This idea coincides with the philosophy of Claude Code or the Claude Agent SDK. If you have experienced a standard Claude Code session, you know the flow is remarkably smooth: you only need basic language skills to describe the task clearly (and concisely), and Claude will gather the information it needs. Sometimes, it stops to ask questions—questions that are often brilliant because they cover exactly what your initial description overlooked, sometimes even suggesting better designs within the question. Then, it generates a Plan and finishes building within minutes.

Of course, the stateless approach has its limitations:

The quality of the codebase must be ensured. For example, if the frontend code does not misuse Tailwind CSS to override component styles, the Agent will notice this while exploring the component’s use cases, learn from this few-shot example, and apply the same practice in subsequent code. Otherwise, one would have to devote extensive sections in Spec to explicitly instruct the agent not to do so.
Sufficient context acquisition methods must be provided. If there are no Agent Skills or tools like context7, knowledge won’t appear out of thin air, which limits the ceiling of the LLM’s capability.
The LLM must cooperate with us. In trying some models, like OpenAI’s Codex series, I sensed they weren’t trained to work this way; a Spec-driven approach suits them better. I suspect this is why Codex App recently added memory features. Meanwhile, models like Claude and GLM ↗ excel at the stateless loop.
Excessive consumption of context windows. When the model runs the “explore subagent”, the scope is inevitably too broad, which consumes more tokens and time and requires additional inference—though I don’t think the spec-driven mode can save on tokens.

Now, only one question remains: in my argument, I specifically pointed out the “vibe coding” scenario. What about other scenarios?

Generalized Vibe Coding: For instance, project management or note organization. The user always needs to provide a Notion MCP to read/write Notion databases, or an Obsidian CLI to access note content. In these cases, the source of truth is external. The model’s task is to operate on these external states. This is essentially no different from vibe coding and fits our reasoning above.
Non-generalized Vibe Coding: For example, topics related to personal life. In this case, unless you meticulously list every trivial matter of your life in a memo and, when making a decision or growing, update the memo first and tell yourself “Bingo, Exp+1,” then this is clearly a scenario where Memory would shine.

Finally#

Thus, we can reach a moderate conclusion. “Stateful” is one approach; “Stateless” is the missing other. For a long time, it has rarely been named or praised, which is one of the reasons I wrote this article.

I have mentioned that some stateful practices are neurotic in nature - but the whole history of software engineering, and even human history, is structurally neurotic, and “neurotic” has become a purely neutral word: type checking, unit testing, formal proof, CI/CD, these are quite great practices.

I do not believe there is a hierarchy between the two methods; the scenarios where Stateless doesn’t apply and its limitations are very clear. I promote the Stateless mode out of personal preference, just as others promote Stateful practices. But I have never believed that Stateful should be discarded or that Stateless is the only “correct” way. The purpose of proposing Stateless is to find more paths in the practice of ensuring that the LLM’s potentially incorrect outputs become correct—because that is the only way to get one step closer to the truth.