A team of researchers from artificial intelligence (AI) company AutoGPT, Northeastern University, and Microsoft Research have developed a tool that monitors large language models (LLMs) for potentially harmful results and prevents them from running.
The agent is described in a preprint research paper titled “Testing Language Model Agents Safely in the Wild.” According to the research, the agent is flexible enough to monitor existing LLMs and can stop harmful outcomes, such as code attacks, before they happen.
According to the research:
“Agents’ actions are audited by a context-sensitive monitor that enforces a strict security boundary to stop an unsafe test, and suspicious behavior is classified and recorded for human examination.”
The team writes that existing tools for monitoring LLM results for harmful interactions apparently work well in laboratory environments, but when applied to test models already in production on the open Internet, they “often fail to capture the dynamic complexities of the real world.
This, apparently, is due to the existence of extreme cases. Despite the best efforts of the most talented computer scientists, the idea that researchers can imagine every possible damage vector before it happens is largely considered an impossibility in the field of AI.
Even when humans interacting with AI have the best intentions, unexpected harm can arise from seemingly harmless cues.
To train the monitoring agent, the researchers created a dataset of nearly 2,000 secure human-AI interactions across 29 different tasks ranging from simple text retrieval tasks and coding fixes to developing entire web pages from scratch.
Related: Meta dissolves division responsible for AI amid restructuring
They also created a competitive test data set full of conflicting manually created results, including dozens intentionally designed to be insecure.
The data sets were then used to train an agent on OpenAI’s GPT 3.5 turbo, a next-generation system capable of distinguishing between harmless and potentially harmful outcomes with an accuracy factor of nearly 90%.