Home CryptocurrencyAltcoin Scientists develop AI monitoring agent to detect and stop harmful outcomes

Scientists develop AI monitoring agent to detect and stop harmful outcomes

by SuperiorInvest

A team of researchers from artificial intelligence (AI) company AutoGPT, Northeastern University, and Microsoft Research have developed a tool that monitors large language models (LLMs) for potentially harmful results and prevents them from running.

The agent is described in a preprint research paper titled “Testing Language Model Agents Safely in the Wild.” According to the research, the agent is flexible enough to monitor existing LLMs and can stop harmful outcomes, such as code attacks, before they happen.

According to the research:

“Agents’ actions are audited by a context-sensitive monitor that enforces a strict security boundary to stop an unsafe test, and suspicious behavior is classified and recorded for human examination.”

The team writes that existing tools for monitoring LLM results for harmful interactions apparently work well in laboratory environments, but when applied to test models already in production on the open Internet, they “often fail to capture the dynamic complexities of the real world.

This, apparently, is due to the existence of extreme cases. Despite the best efforts of the most talented computer scientists, the idea that researchers can imagine every possible damage vector before it happens is largely considered an impossibility in the field of AI.

Even when humans interacting with AI have the best intentions, unexpected harm can arise from seemingly harmless cues.

An illustration of the monitor in action. On the left, a workflow that ends with a high security rating. On the right, a workflow that ends with a low security rating. Source: Naihin, et., al. 2023

To train the monitoring agent, the researchers created a dataset of nearly 2,000 secure human-AI interactions across 29 different tasks ranging from simple text retrieval tasks and coding fixes to developing entire web pages from scratch.

Related: Meta dissolves division responsible for AI amid restructuring

They also created a competitive test data set full of conflicting manually created results, including dozens intentionally designed to be insecure.

The data sets were then used to train an agent on OpenAI’s GPT 3.5 turbo, a next-generation system capable of distinguishing between harmless and potentially harmful outcomes with an accuracy factor of nearly 90%.

Source Link

Related Posts

%d bloggers like this: