update 2025-07-18: add context engineering

LLM Agent

LLM agent is a really interesting topic and I want to talk about it in this post. This is a consolidation of my recent reading, listening, thinking and hands-on experience. I hope you will find it useful. I will try my best to list the sources and highly recommend you to read them.

Agent, workflow and in-context learning

Speaking of LLM application, in my mind, there are three main categories:

in-context learning
workflow
agent

In-context learning is the most common way to use LLM, zero-shot or few-shot. Examples are Chain-of-thought[1] and content moderation from Anthropic.

And for workflow and agent, I mainly refered to Building effective agents from Anthropic. Workflow and agent are both agentic systems. But they are different.

Workflows are systems where LLMs and tools are orchestrated through predefined code paths. Workflows offer predictability and consistency for well-defined tasks

Agents, on the other hand, are systems where LLMs dynamically decide what tools to use and take control on how to complete the task. Agents are the better option when flexibility and model-driven decision-making are needed at scale.

So in my words, LLM-based agent is a new way to solve certain types of problems. LLM is good at natural language understanding, reasoning, and generation. Natural language can be the interaction interface between user and agent. For example, when you want to trade stocks, in traditional app like Robinhood, you need to click many buttons, but with the agent, you can just say “sell all my Nvidia stocks”. And equiped with external tools, the capability of the LLM can be greatly enhanced. But it is not a silver bullet. Agentic systems often trade latency and cost for better task performance[2]. You should use a well-trained specialized model for facial recognition instead of using the agent.

What problem is agent good at solving? When the user demand is dynamic, vague, and may not well defined

dynamic: user can ask anything, e.g. deep research
vague: user may not well define the problem, e.g. “I want to know about the history of the company”, the agent needs to ask follow up questions to get the detail

What are the breakthroughs of the LLM that enable the agent?

reasoning ability
- OpenAI deep research model is a fine-tuned O3
tool use ability
- missing information handling: Claude Opus is better than Claude Sonnet, example
longer context length
- Claude latest model 200K, Gemini 1 million, GPT4 1 million
instruction following ability improvement
- The prompt tuning techniques referenced in the Claude tool use course are not necessary and can be avoided now. I tried with Claude 3.5 Haiku and 3.7 Sonnet.

Interaction

LLM is really good at natural language understanding, reasoning, and generation which makes it a perfect unit for the interaction between agent system and user with natural language.

natural language as the interface between agent system and user
LLM as the information processing, reasoning, planning and tool use
LLM synthesizing the results from tools or sub-agents, and respond to the user in natural language
- For simple tasks, the LLM can directly respond to the user in natural language with its own knowledge
- For complex tasks, the LLM can synthesize the results from tools or sub-agents, and respond to the user in natural language

Tools

For how to use tools with LLM, I highly recommend you to read the tool use course from Anthropic.

Tools can be roughly categorized into three types:

Read:
- keyword search(wikipedia package), web search
- file parsing(e.g. pdf, docx, etc)
- up-to-date information, real-time data
- Agent Browser: a general powerful tool for agent, e.g BrowserBase, grasp
- semantic search, retrieval(vector database, RAG)
- Memory: it is important for personalization and knowledge consolidation
  - Mem0: memory solution for user interaction history
  - Agent Workflow Memory[6] and SkillWeaver[7]: knowledge consolidation for web navigation
Computation
- Calculator
- python code execution
Write:
- Communication: send email, send message, etc
- Web Automation(e.g. Playwright, Selenium, etc): click, fill, etc

MCP: Remote tools

MCP is short for Model Context Protocol. In the official documentation, it is designed to provide context to the LLM. But I currently deem it as a way to provide remote tools to the LLM. For the tools mentioned above, they are mainly local tools. For example, a web search tool, we usually need to write our own code. So if someone has developed a tool for LLM and serve it as a MCP server, we can use it as a remote tool. MCP is a protocal and it standardize many stuffs.

System prompt for tool use

For Claude, when user specify the tools, a special system prompt will be constructed to guide the LLM to use the tools.

In this environment you have access to a set of tools you can use to answer the user's question.

String and scalar parameters should be specified as is, while lists and objects should use JSON format. Note that spaces for string values are not stripped. The output is not expected to be valid XML and is parsed with regular expressions.
Here are the functions available in JSONSchema format:

Best practice for tool description

Provide extremely detailed descriptions

Context engineering

Added this section after I read the Context Engineering for AI Agents: Lessons from Building Manus. Many insights. As agent is doing the task, the context length will be increased significantly. This post talked about why we need to manage the context length from different perspectives.

Cost saving

The prompt caching is a very good way to save the inference cost. It is important to keep the prefix of the prompt stable. And they mentioned this is the most important technique in the context engineering.

And following above, they mentioned that masking tool use instead of removing. Because tool descriptions are usually at the beginning of the prompt, in system prompt or following the system prompt. By removing or updating the tool descripitons, the prompt prefix will not be stable anymore.

Reliably compress the context

By using the file system to keep the context is a good way to compress the context. It is easy to fetch them when needed and keep the pointers to them in the context.

Add the goal every once in a while

This is to add the goal at the recent attention span in the context so that the agent can remember the goal and not forget it.

Keep the mistake in the context

Agents can make mistakes and correct themselves. It is better to keep the mistake traces in the context so that the agent can learn from the mistake and not make the same mistake again.

Introduce controlled randomness

For repeated tasks, the observation and actions might be very similar and form a certain pattern. The introduction of controlled randomness can help the agent to avoid the pattern and explore the space.

Evaluation

The evaluation is very important.

BrowseComp is a benchmark for search agent. The core idea behind the construction of the benchmark is easy to verify but hard to find answers.
SWE-Bench is a benchmark for testing AI system’s ability to solve GitHub issues
ThenAgentCompany is a benchmark for multiple kinds of tasks

Personalization

It is great if users can customize the LLM application based on their own preferences and history data.

System prompt perspective

As mentioned in the AI Horseless Carriages, the AI drafted email is a really bad-designed feature because users cannot easily change the writing based on their own style. One way to improve this is to let users customize the system prompt.

Personal history data perspective

This is very straigtforward. If the LLM can memorize the user interaction history, it can provide a more personalized service instead of letting users tell the LLM again and again.

Performance boost: Multi-agent

I mainly referred to How we built our multi-agent research system from Anthropic.

Different agent acting as different roles, e.g. nanoDeepResearch
- They share the context
Different agents are solving different aspects of the same problem simultaneously
- They have their own context, a way to avoid the context length limitation of a single LLM

Productionization

I mainly referred to How we built our multi-agent research system from Anthropic.

Handle failure modes
- Regular checkpoints to ensure the system failure will not lead to restart from the beginning
- Let the LLM know the tool is not working and let it adapt
Adding full production tracing for debugging
Rainbow deployment to avoid the distruption of the running agents.

Limitation

Context length: although some SOTA models have 1 million context length, it is still not enough for some cases.
number of tools can be used: the number of tools is limited by the context length.
LLM’s ability, e.g. directly generating sql to query/update database instead of having a specific tool for a specific task
Higher cost than in-context learning and workflow: the task value should warrant the cost.

Miscellaneous

Building Effective Agents[2]

Agentic system: workflow and agent
- workflow: systems where LLMs and tools are orchestrated through predefined code paths
- agent: systems where LLMs dynamically decide what tools to use and take control on how to complete the task
- Considerations on agentic system
  - Agentic systems often trade latency and cost for better task performance
  - workflows offer predictability and consistency for well-defined tasks
  - agents are the better option when flexibility and model-driven decision-making are needed at scale
  - For many applications, however, optimizing single LLM calls with retrieval and in-context examples is usually enough
- Building block: The augmented LLM, tools, retrieval, memory
- Workflow
  - Prompt chaining: when the task can be broken down into multiple steps and the LLM call in the next step will take the output of the previous step as the input, e.g. writing a marketing plan and then translate it into different languages
  - Routing: Routing works well for complex tasks where there are distinct categories that are better handled separately, and where classification can be handled accurately, either by an LLM or a more traditional classification model/algorithm. e.g. customer support ticketing system.
  - Parallelization: when the task can be broken down into parallel subtasks, or you want to run task multiple times to get diverse output
  - Orchestrator-workers workflow: similar to parallelization, but the subtasks are not pre-defined. The orchestrator will dynamically decide what subtask to run and what workers to use.
  - Evaluator-optimizer workflow: one evaluator LLM to evaluate the output of the generator LLM and generator LLM will generate the next version of the output based on the feedback from the evaluator.
- Agent: Agents can be used for open-ended problems where it’s difficult or impossible to predict the required number of steps, and where you can’t hardcode a fixed path.
- Key to success of LLM feature: The key to success, as with any LLM features, is measuring performance and iterating on implementations. To repeat: you should consider adding complexity only when it demonstrably improves outcomes.

Why LLM application can be a business?

Platform companies(e.g. Google, Bytedance, Meta): connect -> ads LLM application companies: solving the problem -> user pay

Some real world products

References

[1]: @misc{wei2023chainofthoughtpromptingelicitsreasoning, title={Chain-of-Thought Prompting Elicits Reasoning in Large Language Models}, author={Jason Wei and Xuezhi Wang and Dale Schuurmans and Maarten Bosma and Brian Ichter and Fei Xia and Ed Chi and Quoc Le and Denny Zhou}, year={2023}, eprint={2201.11903}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2201.11903}, }

[2]: https://www.anthropic.com/engineering/building-effective-agents

[3]: @misc{shen2023hugginggptsolvingaitasks, title={HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face}, author={Yongliang Shen and Kaitao Song and Xu Tan and Dongsheng Li and Weiming Lu and Yueting Zhuang}, year={2023}, eprint={2303.17580}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2303.17580}, }

[4]: @misc{lu2023chameleonplugandplaycompositionalreasoning, title={Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models}, author={Pan Lu and Baolin Peng and Hao Cheng and Michel Galley and Kai-Wei Chang and Ying Nian Wu and Song-Chun Zhu and Jianfeng Gao}, year={2023}, eprint={2304.09842}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2304.09842}, }

[5]: @misc{wang2024troveinducingverifiableefficient, title={TroVE: Inducing Verifiable and Efficient Toolboxes for Solving Programmatic Tasks}, author={Zhiruo Wang and Daniel Fried and Graham Neubig}, year={2024}, eprint={2401.12869}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2401.12869}, }

[6]: @misc{wang2024agentworkflowmemory, title={Agent Workflow Memory}, author={Zora Zhiruo Wang and Jiayuan Mao and Daniel Fried and Graham Neubig}, year={2024}, eprint={2409.07429}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2409.07429}, }

[7]: @misc{zheng2025skillweaverwebagentsselfimprove, title={SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills}, author={Boyuan Zheng and Michael Y. Fatemi and Xiaolong Jin and Zora Zhiruo Wang and Apurva Gandhi and Yueqi Song and Yu Gu and Jayanth Srinivasa and Gaowen Liu and Graham Neubig and Yu Su}, year={2025}, eprint={2504.07079}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2504.07079}, }