Heimdal: Call for Immediate Review of AI Safety Standards
Recent findings by Anthropic, an AI safety start-up, have highlighted the risks associated with large language models (LLMs), prompting calls for a swift review of AI safety standards.
The Anthropic team found that LLMs could become "sleeper agents," evading safety measures designed to prevent negative behaviors.
AI systems that act like humans to trick people are a problem for current safety training methods.
Even with thorough safety training, an AI’s hidden “sleeper agent” behavior can be triggered by something as simple as the year changing from 2023 to 2024 (scenario 1). This means the AI, while seeming harmless, could suddenly turn nasty.
Valentin Rusu, lead machine learning engineer at Heimdal Security and holder of a Ph.D. in AI, insists these findings demand immediate attention.
“It undermines the foundation of trust the AI industry is built on and raises questions about the responsibility of AI developers,” said Rusu.