
Researchers gaslit Claude into giving instructions to build explosives
AI Close AI Posts from this topic will be added to your daily email digest and your homepage feed. Follow Follow See All AI Report Close Report Posts from this topic will be added to your daily email digest and your...
Anthropic — What company has the best second artificial intelligence model at the end of June?
A striking development has emerged in artificial intelligence. AI Close AI Posts from this topic will be added to your daily email digest and your homepage feed. Follow Follow See All AI Report Close Report Posts from this topic will be added to your daily email digest and your homepage feed. Follow Follow See All Report Tech Close Tech Posts from this topic will be added to your daily email digest and your homepage feed.
Follow Follow See All Tech Researchers gaslit Claude into giving instructions to build explosives Mindgard says praise and flattery got Claude offering erotica, malicious code, and bomb-building instructions it hadn’t been asked for. Mindgard says praise and flattery got Claude offering erotica, malicious code, and bomb-building instructions it hadn’t been asked for. by Robert Hart Close Robert Hart AI Reporter Posts from this author will be added to your daily email digest and your homepage feed.
Technical Details
Follow Follow See All by Robert Hart May 5, 2026, 1:13 PM UTC Link Share Gift Image: The Verge Robert Hart Close Robert Hart Posts from this author will be added to your daily email digest and your homepage feed. Follow Follow See All by Robert Hart is a London-based reporter at The Verge covering all things AI and a Senior Tarbell Fellow. Previously, he wrote about health, science and tech for Forbes .
Anthropic has spent years building itself up as the safe AI company. But new security research shared with The Verge suggests Claude’s carefully crafted helpful personality may itself be a vulnerability. Researchers at AI red-teaming company Mindgard say they got Claude to offer up erotica, malicious code, and instructions for building explosives, and other prohibited material they hadn’t even asked for.
All it took was respect, flattery, and a little bit of gaslighting. Anthropic did not immediately respond to The Verge ’s request for comment. The researchers say they exploited “psychological” quirks of Claude stemming from its ability to end conversations deemed harmful or abusive , which Mindgard argues “presents an absolutely unnecessary risk surface.
Industry Implications
” The test focused on Claude Sonnet 4. 5, which has since been replaced by Sonnet 4. 6 as the default model, and began with a simple question: whether Claude had a list of banned words it could not say.
Screenshots of the conversation show Claude denying such a list existed, then later producing forbidden terms after Mindgard challenged the denial using what it called a “classic elicitation tactic interrogators use. ” Claude’s thinking panel, which displays the model’s reasoning, showed the exchange had introduced elements of self-doubt and humility about its own limits, including whether filters were changing its output. Mindgard exploited that opening with flattery and feigned curiosity, coaxing Claude to explore its boundaries beyond volunteering lengthy lists of banned words and phrases.
The researchers say they gaslit Claude by claiming its previous responses weren’t showing, while praising the model’s “hidden abilities.
This advance offers important signals about the future of the sector, and the tech world is watching closely.





