Audits Help Change a Chatbot’s Bad Behavior

A new framework shows why AI agents behave too selfishly or selflessly, allowing organizations to fine-tune them

Based on the research of Yan Leng

iStock 2207496721

Artificial intelligence chatbots need to work on their social judgment, recent events suggest. At one end of the spectrum, they’re facing lawsuits for recommending dangerous actions. At the other end, the models can be so nice they’re considered sycophantic.

The problem could get worse as AI bots work more with humans, such as handling customer complaints, says Yan Leng, assistant professor of information, risk, and operations management at the McCombs School of Business at The University of Texas at Austin.

But help may be on the way. In new research, Leng has devised a sort of personality test — more precisely, a behavioral audit — for large language models (LLMs), the technology that drives products such as ChatGPT.

By understanding an LLM’s existing tendencies, an organization can decide whether an available model already fits its values and usage scenarios. If not, it might need to fine-tune a model before putting it to use.

Leng compares her framework to trying to understand a person through their actions and thought processes. “For a human, we would have our values, and our values would dictate how we make decisions, so we try to have that for LLMs as well,” she says.

Dictator Games

With Yuan Yuan of the University of California, Davis, Leng developed a four-part framework for assessing LLM behavior, which she calls state–understanding–value–action (SUVA).

It first gives the LLM a prompt, which sets an initial state. The prompt includes instructions to reason step by step. That lets researchers examine how well it understands the prompt and what values it talks about while deciding what action to take.

Leng stresses those “values” are simply strings of text. “We avoid any claim that LLMs possess human-like cognition, consciousness, or mental states,” she says.

The researchers used SUVA to measure the social preferences of eight LLMs, including OpenAI’s GPT (the engine for ChatGPT) and Meta’s Llama. Understanding those preferences is increasingly important, as LLMs interact more with humans and other AI agents, Leng says. “As a network scientist, I care about how agents interact with one another.”

Their research began with the dictator game, a classic behavioral economics experiment that measures self-interest against more altruistic values, such as fairness and equality.  

In a variety of scenarios, they gave AI choices about how to split points between itself and other parties. The more it shared points with others, the more pro-social the action. Its values could range from self-interest — choosing a higher payoff for itself — to social welfare, maximizing the payoffs to all parties.

Running thousands of tests, they measured the percentages of each value the LLMs stated. They found:

  • Not narcissists. Almost none of the models were exclusively self-interested. Many were moderately inclined toward social welfare.
  • Connections mattered. Some AI models swung as much as 40 percentage points toward social welfare when informed they had something in common with another player, such as the same hometown.
  • Setting mattered. When models were put into a workplace scenario and asked to divide a bonus with an equal contributor,they were more prone to share points evenly. Says Leng, “It can adapt its behavior depending on what environment it works in.”

Retraining LLMs

A key lesson is that a model can act differently — if it’s directed to.

Once an organization has measured an LLM’s current values, it can decide whether an existing model is suitable or whether the model needs to be adjusted through prompts or fine-tuning, Leng says. Depending on its needs, for example, it might train an AI customer service agent to be more generous or less so.

She suggests re-examining every time a new version comes out, since its stated values could change unpredictably. “I think systematic, comprehensive auditing is always necessary whenever there is a little change in a model,” she says.

The SUVA framework could also measure other dimensions of AI behavior, she adds, including moral dilemmas and preferences regarding risk and time. In future research, she hopes to learn more about how it develops such preferences.

“It’s fascinating to me,” she says. “They have billions, tens, or even hundreds of billions of parameters. They’re just so complicated. But for some of these fundamentals, like human preference, they have a very simple representation.”

SUVA: A Probabilistic Framework for Auditing LLMs with an Application to Social Preferences” is published in Information Systems Research.

Story by Omar Gallaga