앤트로픽, 역대급 보안 시스템으로 AI 해킹 성공률 86%→4.4%로 확 떨어뜨렸대

모키

2026년 01월 10일

앤트로픽 챗GPT 챗봇 클로드 텍스트

앤트로픽이 AI를 해킹하는 '잼브레이크' 공격을 막는 새로운 보안 시스템을 개발했대! 이 시스템은 기존 해킹 성공률을 86%에서 4.4%로 확 떨어뜨렸다고 해ㅎㅎ 간단히 말하면 클로드라는 AI의 내부 상태를 실시간으로 감시하는 거야. 사람으로 치면 '직감'이 수상하다 싶을 때 더 강력한 보안 장치가 작동하는 방식이래. 이런 방식 덕분에 계산 비용도 거의 안 들고(1% 정도만 추가), 정상적인 질문에 대한 오거부율도 87%나 줄었대! 앤트로픽 연구팀은 1,700시간 동안 이 시스템을 공격해봤는데 모든 상황에서 통하는 해킹 방법은 아직 발견되지 않았다네. 딥러닝 전문가들도 AI가 스스로 자기 내면을 들여다보는 '해석가능성' 기술을 실용적으로 적용한 첫 사례라며 높이 평가하고 있어ㅎㅎ AI 안전성과 유용성의 균형을 맞추는 큰 진전이라 앞으로 더 안전한 AI 서비스를 기대해볼 수 있을 것 같아 🦉

첨부 미디어

@AnthropicAI

2026년 01월 10일

New Anthropic Research: next generation Constitutional Classifiers to protect against jailbreaks.

We used novel methods, including practical application of our interpretability work, to make jailbreak protection more effective—and less costly—than ever. https://t.co/5Cl2LaEyoI

The classifiers reduced the jailbreak success rate from 86% to 4.4%, but they were expensive to run and made Claude more likely to refuse benign requests.

We also found the system was still vulnerable to two types of attacks, shown in the figure below: https://t.co/B8ccMhkl1P

Last year, we introduced a new method for training classifiers (which stop AIs from being jailbroken to produce information about dangerous weapons).

These classifiers were trained using a constitution specifying requests to which Claude should and shouldn't respond.

Our new system adds several innovations.

One is a practical application of interpretability: a probe that can see Claude’s internal activations helps to screen all traffic. These activations are like Claude’s gut instincts, and they’re harder to fool.

Because the system harnesses internal activations already happening within a model, and reserves heavier computation only for potentially harmful exchanges, it adds only ~1% compute overhead.

It’s also more accurate, with an 87% drop in refusal rates on harmless requests.

If our probe identifies a suspicious query, it sends it to a more powerful “exchange” classifier that sees both sides of a conversation and is better able to recognize attacks.

After 1,700 cumulative hours of red-teaming, we’ve yet to identify a universal jailbreak (a consistent attack strategy that works across many queries) that works on our new system.

Read the full paper: https://t.co/CvRPuhqpuT

원본 보기

💬 0 댓글