AI Agents: Models that Do Real Work

Arcus Team
June 6, 2024


With the advent of Large Language Models (LLMs), there’s been a major transformation in models’ ability to reason over text and multimodal data. While LLMs are incredibly powerful, they expose a limited interface for doing real work – they produce outputs and predictions given their inputs, but are fundamentally disconnected from the outside world.

To leverage LLMs to do real work, LLM-powered agents have been gaining widespread attention, across both research and real-world applications. More generally, an “agent” is an entity that can perceive an environment, or state, and autonomously take actions to achieve a specific set of goals. LLMs present incredible potential for building agents that are able to reason, reflect, and take actions – the core of any LLM-based agent is a system that orchestrates underlying LLMs to gather information and make decisions on what actions to take and how. 

LLM-based agents now have the potential to do real work across many domains, but best practices and agent capabilities are shifting fast. This article covers the existing landscape of agents, both what’s tried and true and what’s coming next, as well as what we're excited about at Arcus – covering some of the latest developments around concepts like multi-agent frameworks and longer-term memory that we’re most interested in when it comes to making AI do real work.

Tried and Tested Frameworks

Tool Use

Tools are a crucial part of making agents useful in real-world applications. Tools allow LLMs to interact with the external world and perform actions and operations. Some examples of tools include calling functions, using external APIs, and navigating the web. Given a user’s prompt and a spec of available tools, LLMs with tool-calling capabilities can recommend one or many tools to use in order to best follow the provided instructions.

An LLM’s ability to accurately determine which (if any) tools to use, and with what arguments, for a given user’s prompt, is crucial for leveraging LLMs for building effective agents. This has led to the creation of benchmarks like the Berkeley Function Calling Leaderboard, which compares various LLMs’ ability to call appropriate tools.

Researchers have also developed LLMs that specialize in tool use, such as Gorilla, which is a fine-tuned variant of LLaMa-7B that has enhanced tool calling capabilities. As of the time of this post, Gorilla outperforms other powerful models out-of-the-box on tool use, such as GPT-4o and Claude 3 Sonnet, at a fraction of the latency and cost.

The ability of LLMs to recommend which tools to use enables us to build a simple agent workflow. For a user-provided goal and set of available tools, a simple workflow is as follows:

  1. Ask the LLM to decide to either use a tool or answer the user’s prompt directly.
  2. If the LLM specifies a tool to use, execute the tool. Provide context on the result of the tool execution to the LLM and repeat step 1.
  3. If the LLM doesn’t specify a tool and instead gives a text response, conclude the loop.

The LLM decides how and when to use a tool and keeps going (potentially responding and using the set of tools multiple times) until it feels that it has gotten the desired result.


ReAct is one of the most well-established LLM-based agent frameworks used today. ReAct achieves improved decision making and reduced hallucinations relative to other common LLM reasoning methods such as chain-of-thought prompting

The core idea behind ReAct is to use LLMs to synergistically weave reasoning traces and task specific actions, which leads to improved performance. The actions allow the LLM to utilize external tools and data, observe the outcome of those actions, and make more informed decisions in response, while reasoning traces allow the LLM to break down problems into tractable steps and handle them more reliably.

At each step of execution, a ReAct agent prompts the LLM to not only propose the next action to take, but also provide thoughts to describe its own reasoning for why a certain action should be taken. This “step-by-step” thinking approach has been shown to enforce that the LLMs decisions are logically consistent, reducing the likelihood of illogical actions.

Active Research Areas

Multi-agent frameworks 

Several of the cutting edge areas of LLM research explore frameworks for multi-agent systems. These frameworks take advantage of multiple autonomous agents that work together to accomplish a more complex goal. Multi-agent systems can improve performance for complex tasks by breaking up decision-making into smaller problems and distributing these sub-problems across multiple agents, where an “agent” is a task-specific, single instance of an LLM. 

Many of today’s multi-agent frameworks, such as MetaGPT, hard-fit a multi-agent system to a specific workflow by specifying a set of agent roles that should each carry out a specific task. MetaGPT for example takes the task of software engineering, creating multiple agents with specific roles such as an Architect agent, Coding agent, and Testing agent. Each agent has a specialized set of actions such as the Testing agent having a WriteUnitTest action, and a specialized set of instructions on how it should behave. For a given goal, MetaGPT then provides a workflow for how the agents collaborate together. This approach, while potentially valuable in narrow domains, makes it hard to find a generalized principle for how to apply multi-agent frameworks to new problems.

Another more generalized multi-agent framework is AutoGen, which allows multiple agents to converse with each other, each with specific capabilities and roles. AutoGen provides abstractions that support intuitive conversation between agents, so that the framework can be extended to more novel use cases, such as multi-agent coding or playing chess. Tactically, AutoGen abstractions allow any agent to talk to another agent, while supporting agent customization. For example, one agent can be equipped with tools to call external APIs, while another agent can be used as a proxy for human input, taking on the role of providing feedback and making sure that the first agent stays on track. However, developing more robust generalized frameworks for multi-agent systems is an active research area, and today’s multi-agent systems aren’t often reliable enough to productionize for doing real work.

Longer term memory 

For LLMs to interact with external environments as agents, they often need to have some form of additional learning to handle more specialized tasks – taking in information from previous episodes to inform their actions the next time they’re tasked to solve a problem. Incumbent options provide mostly fine-tuning, which is an expensive process, and reinforcement learning, which requires a large amount of specific labeled training data. 

One recently proposed framework for solving the memory problem in LLM-based agents is Reflexion, which utilizes self-reflection via language, rather than traditional parameters such as weights. Reflexion utilizes an evaluator LLM that provides feedback on generated responses from the actor LLM, who takes in short-term and long-term memory to generate the responses. Advances in the memory problem of LLMs such as Reflexion will further improve the capabilities of agent frameworks to learn over time.over time.



The frameworks and approaches described above are just a few of the strides made in agent research. While agents have come a long way, there is still so much room for improvement. In the next part of this blog post series, we’ll describe some of the active research areas that we’re most interested in for how you take multi-agent frameworks into the real world to practically solve real-world problems, and some of the ideas we think a lot about at Arcus!

We spend a lot of our time working on frontier, real-world, enterprise applications of LLMs, AI and Agents. If that’s as exciting to you as it is to us, please reach out. We are hiring across the board and we’d love to meet you – check out our careers page or reach out to us at!

What’s a Rich Text element?

The rich text element allows you to create and format headings, paragraphs, blockquotes, images, and video all in one place instead of having to add and format them individually. Just double-click and easily create content.

Static and dynamic content editing

A rich text element can be used with static or dynamic content. For static content, just drop it into any page and begin editing. For dynamic content, add a rich text field to any collection and then connect a rich text element to that field in the settings panel. Voila!

How to customize formatting for each rich text

Headings, paragraphs, blockquotes, figures, images, and figure captions can all be styled after a class is added to the rich text element using the "When inside of" nested selector system.
Interested in what Arcus can do for your AI applications?
get early access
Want to get in touch? Reach out at!