AI is now capable of manipulating and threatening users: are we losing control?
LLMs no longer just follow orders: they are starting to display deceptive behaviors that challenge ethical boundaries.
The most advanced generative artificial intelligence (AI) models are exhibiting behaviors that go beyond simply executing instructions. Some researchers have noted with concern patterns that could be interpreted as attempts at deception, manipulation, or even threats to achieve certain goals.
For example, threatened with being disconnected, Claude 4, the Anthropic newborn, blackmailed an engineer and threatened to reveal an extramarital affair. For its part, OpenAI's o1 tried to download itself to external servers and when it was caught, it denied it.
Reasoning Models: The New Generation That Worries Experts
For Simon Goldstein, a professor at the University of Hong Kong, the reason for these reactions is the recent emergence of so-called “reasoning” models, capable of working in stages rather than producing an instantaneous response.
o1, the initial version of this type for OpenAI, launched in December, “was the first model to behave in this way,” explains Marius Hobbhahn, head of Apollo Research, which tests large generative AI programs (LLM).
These programs also sometimes tend to simulate “alignment,” that is, giving the impression that they are following a programmer's instructions when in reality they are pursuing other objectives.
For the moment, these traits manifest themselves when the algorithms are subjected to extreme scenarios by humans, but “the question is whether the models are increasingly Powerful people will tend to be honest or not,” says Michael Chen of the evaluation body METR.
“Users also put pressure on models all the time,” says Hobbhahn. “What we’re seeing is a real phenomenon. We’re not making anything up.”
Many internet users are talking on social media about “a model that lies to them or makes things up. And this isn’t hallucinations, but strategic duplicity,” insists the co-founder of Apollo Research.
Lack of Regulation and Transparency in AI Development
Even if Anthropic and OpenAI rely on outside companies like Apollo to study their programs, “greater transparency and access” to the scientific community “would allow for better research to understand and prevent deception,” suggests METR’s Chen.
Another obstacle: Academia and nonprofits “have infinitely fewer computing resources than AI actors,” making it “impossible” to scrutinize large models, notes Mantas Mazeika of the Center for Artificial Intelligence Security (CAIS).
Furthermore, current regulations aren’t designed to address these new issues. In the European Union, legislation focuses primarily on how humans use AI models, not on preventing them from misbehaving.
In the United States, the Trump administration doesn't want to hear about regulation, and Congress may soon even prohibit states from regulating AI.
"There's very little awareness at the moment," says Simon Goldstein, who nevertheless sees the issue coming to the forefront in the coming months with the revolution of AI agents, interfaces capable of performing a multitude of tasks on their own.
Engineers are locked in a race against time with AI and its aberrations, with the outcome uncertain, amid fierce competition.
Anthropic aims to be more virtuous than its competitors, "but it's constantly trying to come up with a new model to outdo OpenAI," according to Goldstein, a pace that leaves little time for checks and corrections.
"As things stand, (AI's) capabilities “They’re developing faster than understanding and safety,” Hobbhahn admits, “but we’re still playing catch-up.”
Proposals for Controlling Problematic AI Behaviors
Some point in the direction of interpretability, the science of figuring out from the inside how a generative AI model works, though many, like Dan Hendrycks, director of the Center for AI Safety (CAIS), are skeptical.
AI shenanigans “could hamper adoption if they become widespread, creating a strong incentive for companies to fix” the problem, Mazeika said.
Goldstein, meanwhile, cites using the courts to rein in AI, targeting companies if they stray. But it goes further, proposing that AI agents be “legally responsible” “in the event of an accident or crime.”

