What I learned from building Wattle Coding

Wattle is one of my recent side projects. It is a coding agent or harness that supports multiple LLM providers. There are already many great coding agents like Claude Code, Codex, Pi etc, why I still want to build Wattle are:

I am a fan of Python, I want to see if I can build a good one with it
I want to see what I will encounter during the process of building a coding agent from scratch
I want to have a hackable agent that I can tweak to see its effect
and finally it is fun

It is currently only having YOLO mode which is the only way I use coding agents. At start it was built with Codex and later Wattle is able to develope itself which is very exciting. I hope you like it, give it a try and let me know what you think.

Where I spent most of my time is TUI

At start I used Textual to build the Terminal UI(TUI) for Wattle. But it had a lot of issues, for example when I scrolled up, the previous history of original terminal cannot be shown and TUI looked really seprated from the terminal. I hope the whole experience is seamless. When user scroll up and down, it is just like scrolling the original terminal. So I decided to let Codex to build the native TUI instead of using the 3rd party library.

As Codex can write a lot of code, I actually don’t write any single line of code myself, but I spent a lot of time testing the TUI by using Wattle to write code, read the files in a repo, etc. So my role is more like a product manager instead of a developer.

This is actually very interesting. A few months ago, I started the idea of writing a coding agent and then I spent a lot of time reading the Anthropic API documentation, trying to understand for exxample how to keep the thinking block etc. Then I gave up because the progress was too slow. This time I changed my mind completely: I will not read the API documentation, not write any code and only let Codex to them. I will only read when I really need to understand something. The development process is much faster. Most of the time, I need to test the final user experience. And that is it.

Prompt caching can be surprising

When I tested Wattle on one task in terminal bench, i thought it was a good idea to add some visibility on the caching performance. The result was very surprising. Wattle with Codex provider only had 20% cache hit rate. Compared to OpenAI completion providers(DeepSeek, Minimax etc)’s ~80% cache hit rate, this was insane. With some investigation, it was because I didn’t send the session id in the api request.

Frontier models are really good now and harness is to steer the flow

At start the systemp prompt is very simple like “You are wattle, a coding agent that can write code, read files, etc.” With this system prompt, Wattle can already do many things well. But that is not good enough. For example, when building the subagents, the primary agent would jump to do the tasks that have already been assigned to one subagent instead of waiting. And this is the time to add some instruction to the system prompt to not let this happen. The system prompt is growing in this way. The experience with Wattle looks weird, then see if it is needed to change the system prompt to make it right.

It is important to have some feedback loop to have more automation

For example when I built TUI at the beginning, there are only unit tests which GPT5.5 has been trained to write. But that is NOT enough since TUI is about UI, often time Codex said it finished the task very well, but when I tested on TUI, it was not working. So I let Codex write some TUI regression tests to test the change with a PTY. This kind of issue is mitigated a lot. I think as a developer, now we need to think about how to design a better feedback loop to guide the model to do the right thing. This will help make our lives much easier.

Coding agents can write code, how about developers?

The frontier models from end of last year to now have improved a lot on coding. And coding agents like Codex and Claude Code have been the main drivier of the revenue for those labs. As many people have said, building something is much easier than before, what is more important is thinking about what to build. I think what to build includes both the functionality and user experience. Developers can do a lot of projects at the same time depending on how well you can do the context switch.

You may think that coding agents can help a lot, so you might have more free time. But that may not be true because you may want to do more things.

I think at the end of the day, more important is to have a good life instead of doing more things.

Benchmark Evaluation

I used the terminal bench 2.0 for evaluation. I would say this is a good benchmark. But it also has its own limitations. For example, the real use cases on coding agent, often time, it is multi-turns instead of just sending a prompt and waiting for the answer. Coding agents can make mistaks and human can tell them.

I tested 4 models, GPT5.5, DeepSeek V4 Pro, Minimax 2.7 and Kimi 2.6. I reported only first 2 because I have subscription for Codex and DeepSeek is crazily cheap. For Minimax and Kimi, I added $10 for their API, but soon my account dropped to 0.

The metrics are below, for each task, I tested 5 times. To reproduce please check wattle-tbench-harness

Provider	Model	Trials	Accuracy
codex	gpt-5.5	445	68.3%+-7.8
deepseek	deepseek-v4-pro	445	37.5%+-8.6

Final Thoughts

This is a great journey. Without the help of coding agents, this project would not be possible since for example I know nothing about building a TUI.

On user model interaction, advanced models can do many things, we will probably see a better way for human to use models in the future.