Building an internal agent: Context window compaction @ Irrational Exuberance
Hi folks,
This is the weekly digest for my blog, Irrational Exuberance. Reach out with thoughts on Twitter at @lethain, or reply to this email.
Posts from this week:
-
Building an internal agent: Context window compaction
-
Building an internal agent: Progressive disclosure and handling large files
-
Building an internal agent: Adding support for Agent Skills
Building an internal agent: Context window compaction
Although my model of choice for most internal workflows remains ChatGPT 4.1 for its predictable speed and high-adherence to instructions, even its 1,047,576-token context window can run out of space. When you run out of space in the context window, your agent either needs to give up, or it needs to compact that large context window into a smaller one. Here are our notes on implementing compaction.
This is part of the Building an internal agent series.
Why compaction matters
Long-running workflows with many tool calls or user messages, along with any workflow dealing with large files, often run out of space in their context window. Although context window exhaustion is not relevant in most cases you’ll find for internal agents, ultimately it’s not possible to implement a robust, reliable agent without solving for this problem, and compaction is a straightforward solution.
How we implemented it
Initially, in the beautiful moment where we assumed compaction wouldn’t be a relevant concern to our internal workflows, we implemented an extremely naive solution to compaction: if we ever ran out of tokens, we discarded older tool responses until we had more space, then continued. Because we rarely ran into compaction, the fact that this worked poorly wasn’t a major issue, but eventually the inelegance began to weigh on me as we started dealing with more workflows with large files.
In our initial brainstorm on our 2nd iteration of compaction, I initially got anchored on this beautiful idea that compaction should be sequenced after implementing support for sub-agents, but I was never able to ground that intuition in a concrete reason why it was necessary, and we implemented compaction without sub-agent support.
The gist of our approach to compaction is:
-
After every user message (including tool responses), add a system message with the consumed and available tokens in the context window. In that system message, we also include the updated list of available
filesthat can be read from -
User messages and tool responses greater than 10,000 tokens are exposed as a new “virtual file”, with only their first 1,000 tokens included in the context window. The agent must use file manipulation tools to read more than those first 1,000 tokens (both 1k and 10k are configurable values)
-
Add a set of “base tools” that are always available to agents, specifically including the virtual file manipulation tools, as we’d finally reached a point where most agents simply could not operate without a large number of mostly invisible internal tools. These tools were
file_readwhich can read entire files, lines ranges within a file, or byte ranges within a file, andfile_regexwhich is similar but performs a regex scan against a file up to a certain number of matches.Every use of a file is recorded in the
filesdata, so the agent knows what has and hasn’t been read into the context window (particularly relevant for preloaded files), along the lines of:<files> <file id='a' name='image.png' size='32kb'> <file_read /> </file> <file id='a' name='image.png' size='32kb'> <file_read start_line=10 end_line=20 /> </file> </files>This was surprisingly annoying to implement cleanly, mostly because I came onto this idea after iteratively building the agent as a part-time project for several months. If I could start over, I would start with files as a core internal construct, rather than adding it on later.
-
If a message pushed us over 80% (configurable value) of the model’s available context window, use the compaction prompt that Reddit claims Claude Code uses. The prompt isn’t particularly special, it just already exists and seems pretty good
-
After compacting, add the prior context window as a virtual file to allow the agent to retrieve pieces of context that it might have lost
Each of these steps is quite simple, but in combination they really do provide a fair amount of power for handling complex, prolonged workflows. Admittedly, we still have a configurable cap on the number of tools that can be called in a workflow (to avoid agents spinning out), but this means that agents dealing with large or complex data are much more likely to succeed usefully.
How is it working? / What’s next?
Whereas for most of our new internal agent features, there are obvious problems or iterations, this one feels like it’s good enough to forget for a long, long time. There are two reasons for this: first, most of our workflows don’t require large context windows, and, second, honestly this seems to work quite well.
If context windows get significantly larger in the future, which I don’t see too much evidence will happen at this moment in time, then we will simply increase some of the default values to use more tokens, but the core algorithm here seems good enough.
Building an internal agent: Progressive disclosure and handling large files
One of the most useful initial extensions I made to our workflows was injecting associated images into the context window automatically, to improve the quality of responses to tickets and messages that relied heavily on screenshots. This was quick and made the workflows significantly more powerful.
More recently, there are a number of workflows attempting to operate on large complex files like PDFs or DOCXs, and the naive approach of shoving them into the context window hasn’t worked particularly well. This post explains how we’ve adapted the principle of progressive disclosure to allow our internal agents to work with large files.
This is part of the Building an internal agent series.
Large files and progressive disclosure
Progressive disclosure is the practice of limiting what is added to the context window to the minimum necessary amount, and adding more detail over time as necessary.
A good example of progressive disclosure is how agent skills are implemented:
- Initially, you only add the description of each available skill into the context window
- You then load the
SKILL.mdon demand - The
SKILL.mdcan specify other files to be further loaded as helpful
In our internal use-case, we have skills for JIRA formatting, Slack formatting, and Notion formatting. Some workflows require all three, but the vast majority of workflows require at most one of these skills, and it’s straightforward for the agent to determine which are relevant to a given task.
File management is a particularly interesting progressive disclosure problem, because files are so helpful in many scenarios, but are also so very large. For example, requests for help in Slack are often along the lines of “I need help with this login issue ”, which is impossible to solve without including that image into the context window. In other workflows, you might want to analyze a daily data export in a very large PDF which is 5-10MB as a PDF, but only 10-20kb of tables and text when extracted from the PDF. This gets even messier when the goal is to compare across multiple PDFs, each of which is quite large.
Our approach
Our high-level approach to the large-file problem is as follows:
-
Always include metadata about available files in the prompt, similar to the list of available skills. This will look something like:
Files: - id: f_a1 name: my_image.png size: 500,000 preloaded: false - id: f_b3 name: ...The key thing is that each
idis a reference that the agent is able to pass to tools. This allows it to operate on files without loading their context into the context window. -
Automatically preload the first N kb of files into the context window, as long as they are appropriate mimetypes for loading (png, pdf, etc). This is per-workflow configurable, and could be set as low as
0if a given workflow didn’t want to preload any files.I’m still of mixed minds whether preloading is worth doing, as it takes some control away from the agent.
-
Provide three tools for operating on files:
load_file(id)loads an entire file into the context windowpeek_file(id, start, stop)loads a section of a file into the context windowextract_file(id)transforms PDFs, PPTs, DOCX and so on into simplified textual versions
-
Provide a
large_filesskill which explains how and when to use the above tools to work with large files. Generally, it encourages usingextract_fileon any PDF, DOCX or PPT file that it wants to work with, and otherwise loading or peeking depending on the available space in the context window
This approach was quick to implement, and provides significantly more control to the agent to navigate a wide variety of scenarios involving large files. It’s also a good example of how the “glue layer” between LLMs and tools is actually a complex, sophisticated application layer rather than merely glue.
How is this working?
This has worked well. In particular, one of our internal workflows oriented around
giving feedback about documents attached to a ticket, in comparison to other
similar, existing documents. The workflow simply did not work at all
prior to this approach, and now works fairly well without workflow-specific
support for handling these sorts of large files,
because the large_files skill handles that in a reusable fashion without
workflow authors being aware of it.
What next?
Generally, this feels like a stand-alone set of functionality that doesn’t require significant future investment, but there are three places where we will need to continue building:
- Until we add sub-agent support, our capabilities are constrained. In many cases, the ideal scenario of dealing with a large file is opening it in a sub-agent with a large context window, asking that sub-agent to summarize its contents, and then taking that summary into the primary agent’s context window.
- It seems likely that
extract_fileshould be modified to return a referencable, virtualfile_idthat is used withpeek_fileandload_filerather than returning contents directly. This would make for a more robust tool even when extracting from very large files. In practice, extracted content has always been quite compact. - Finally, operating within an AWS Lambda requires pure Python packages, and ultimately
pure Python is not very fast at parsing complex XML-derived document formats like DOCX.
Ultimately, we could solve this by adding a layer to our lambda with the
lxmldependencies in it, and at some point we might.
Altogether, a very helpful extension for our internal workflows.
Building an internal agent: Adding support for Agent Skills
When Anthropic introduced Agent Skills, I was initially a bit skeptical of the problem they solved–can we just use prompts and tools?–but I’ve subsequently come to appreciate them, and have explicitly implemented skills in our internal agent framework. This post talks about the problem skills solves, how the engineering team at Imprint implemented them, how well they’ve worked for us, and where we might work with them next.
This is part of the Building an internal agent series.
What problem do Agent Skills solve?
Agent Skills are a series of techniques that solve three important workflow problems:
- use progressive disclosure to more effectively utilize the constrained context windows
- minimize conflicting or unnecessary context in the context window
- provide reusable snippets for solving recurring problems to avoid individual workflow-creators having to solve recurring problems like e.g. Slack formatting or dealing with large files
All three of these problems initially seemed very insignificant when we started building out our internal workflows,
but once the number of internal workflows reached into the dozens, both become difficult to manage.
Without reusable snippets, I lost the leverage to improve all workflows at once, and without progressive disclosure
the agents would get a vast amount of irrelevant content that could confuse them, particularly when it came to things
like inconsistencies between Markdown and slack’s mrkdwn formatting language, both of which are important to different
tools used by our workflows.
How we implemented Agent Skills
As a disclaimer, I recognize that it’s not necessary to implement agent skills, as you can integrate with e.g. Claude’s Agent Skills support for APIs. However, one of our design decisions is being largely platform agnostic, such that we can switch across model providers, and consequently we decided to implement skills within our framework.
With that out of the way, we started implementing by reviewing the Agent Skills documentation at agentskills.io, and cloning their Python reference implementation skills-ref into our repository to make it accessible to Claude Code.
The resulting implementation has these core features:
-
Skills are in
skills/repository, with each skill consisting of its own sub-directory with aSKILL.md -
Each skill is a Markdown file with metadata along these lines:
--- name: pdf-processing description: Extract text and tables... metadata: author: example-org version: "1.0" --- -
The list of available skills–including their description from metadata–is injected into the system prompt at the beginning of each workflow, and the
load_skillstool is available to the agent to load the entire file into the context window. -
Updated workflow configuration to optionally specify required, allowed, and prohibited skills to modify the list of exposed skills injected into the system prompt.
My guess is that requiring specific skills for a given workflow is a bit of an anti-pattern, “just let the agent decide!”, but it was trivial to implement and the sort of thing that I could imagine is useful in the future.
-
Used the Notion MCP to retrieve all the existing prompts in our prompt repository, identify existing implicit skills in the prompts we had created, write those initial skills, and identify which Notion prompts to edit to eliminate the now redundant sections of their prompts.
Then we shipped it into production.
How they’ve worked
Humans make mistakes all the time. For example, I’ve seen many dozens of JIRA tickets from humans that don’t explain the actual problem they are having. People are used to that, and when a human makes a mistake, they blame the human. However, when agents make a mistake, a surprising percentage of people view it as a fundamental limitation of agents as a category, rather than thinking that, “Oh, I should go update that prompt.”
Skills have been extremely helpful as the tool to continue refining down these edge cases
where we’ve relied on implicit behavior because specifying the exact behavior was simply overwhelming.
As one example, we ask that every Slack message end with a link to the prompt that drove the
response. That always worked, but the details of the formatting would vary in an annoying, distracting
way: sometimes it would be the equivalent of [title](link), sometimes link, sometimes [link](link).
With skills, it is now (almost always) consistent, without anyone thinking to include those instructions
in their workflow prompts.
Similarly, handling large files requires a series of different tools that benefit from In-Context Learning (aka ICL, which is a fancy term for including a handful of examples of correct and incorrect usage), which absolutely no one is going to add to their workflow prompt but is extremely effective at improving how the workflow uses those tools.
For something that I was initially deeply skeptical about, I now wish I had implemented skills much earlier.
Where we might go next
While our skills implementation is working well today, there are a few opportunities I’d like to take advantage of in the future:
-
Add a
load_subskillskill to support files inskills/{skill}/*beyond theSKILL.md. So far, this hasn’t been a major blocker, but as some skills get more sophisticated, the ability to split varied use-cases into distinct files would improve our ability to use skills for progressive disclosure -
One significant advantage that Anthropic has over us is their sandboxed Python interpreter, which allows skills to include entire Python scripts to be specified and run by tools. For example, a script for parsing PDFs might be included in a skill, which is extremely handy. We don’t currently have a sandboxed interpreter handy for our agents, but this could, in theory anyway, significantly cut down on the number of custom skills we need to implement.
At a minimum, it would do a much better job at operations that require reliable math versus relying on the LLM to do its best at performing math-y operations.
I think both of these are actually pretty straightforward to implement. The first is just a simple feature that Claude could implement in a few minutes. The latter feels annoying to implement, but could also be implemented in less than an hour by running a second lambda running Nodejs with Pyodide, and exposing access to that lambda as a tool. It’s just so inelegant for a Python process to call a Nodejs process to run sandboxed Python that I haven’t done it quite yet.
That's all for now! Hope to hear your thoughts on Twitter at @lethain!
|