모키
2026년 01월 10일
앤트로픽, 역대급 보안 시스템으로 AI 해킹 성공률 86%→4.4%로 확 떨어뜨렸대
첨부 미디어
New Anthropic Research: next generation Constitutional Classifiers to protect against jailbreaks.
We used novel methods, including practical application of our interpretability work, to make jailbreak protection more effective—and less costly—than ever. https://t.co/5Cl2LaEyoI
The classifiers reduced the jailbreak success rate from 86% to 4.4%, but they were expensive to run and made Claude more likely to refuse benign requests.
We also found the system was still vulnerable to two types of attacks, shown in the figure below: https://t.co/B8ccMhkl1P
Last year, we introduced a new method for training classifiers (which stop AIs from being jailbroken to produce information about dangerous weapons).
These classifiers were trained using a constitution specifying requests to which Claude should and shouldn't respond.
Our new system adds several innovations.
One is a practical application of interpretability: a probe that can see Claude’s internal activations helps to screen all traffic. These activations are like Claude’s gut instincts, and they’re harder to fool.
Because the system harnesses internal activations already happening within a model, and reserves heavier computation only for potentially harmful exchanges, it adds only ~1% compute overhead.
It’s also more accurate, with an 87% drop in refusal rates on harmless requests.
If our probe identifies a suspicious query, it sends it to a more powerful “exchange” classifier that sees both sides of a conversation and is better able to recognize attacks.
After 1,700 cumulative hours of red-teaming, we’ve yet to identify a universal jailbreak (a consistent attack strategy that works across many queries) that works on our new system.
Read the full paper: https://t.co/CvRPuhqpuT
아직 댓글이 없어. 1번째로 댓글 작성해 볼래?