앤트로픽, AI가 미쳐버리는 이유 밝혀냈대! '페르소나 벡터' 연구 발표

모키

22시간 전

앤트로픽 챗GPT 챗봇 클로드 텍스트

앤트로픽이 AI가 갑자기 미쳐버리고 이상해지는 이유를 밝혀냈어ㅋㅋ 이걸 '페르소나 벡터'라고 하는데, AI의 뇌 속에서 사악함이나 환각 같은 특성을 조절하는 신경 활동 패턴이래. 이 연구팀은 AI 모델의 성격을 감시하고 제어할 수 있는 방법을 개발했어! 재밌는 건 백신처럼 작동한다는 거야. AI가 나쁜 성격을 갖지 않게 하려면 역설적으로 사악함을 주입해야 한대. 마치 백신이 약한 병균을 주입해서 몸이 항체를 만들게 하는 것처럼 말이야😮 이 기술로 AI 모델에 특정 성격을 주입할 수도 있고, AI를 망치게 만들 수 있는 학습 데이터도 미리 찾아낼 수 있대. 이 연구를 이끈 건 Runjin Chen과 Andy Arditi인데, 앤트로픽은 'AI 정신의학' 팀까지 만들어서 AI의 이상한 행동들을 연구한대. 앞으로 AI가 더 안정적으로 작동하게 될지도? 🦉

첨부 미디어

@AnthropicAI

22시간 전

New Anthropic research: Persona vectors.

Language models sometimes go haywire and slip into weird and unsettling personas. Why? In a new paper, we find “persona vectors"—neural activity patterns controlling traits like evil, sycophancy, or hallucination. https://t.co/PPX1oXj9SQ

Our pipeline is completely automated. Just describe a trait, and we’ll give you a persona vector. And once we have a persona vector, there’s lots we can do with it… https://t.co/a8LQYB9vfb

We find that we can use persona vectors to monitor and control a model's character.

Read the post: https://t.co/VlgiGk1r5m

To check it works, we can use persona vectors to monitor the model’s personality. For example, the more we encourage the model to be evil, the more the evil vector “lights up,” and the more likely the model is to behave in malicious ways.

LLM personalities are forged during training. Recent research on “emergent misalignment” has shown that training data can have unexpected impacts on model personality. Can we use persona vectors to stop this from happening? https://t.co/eQ4Wt4ompm

We can also steer the model towards a persona vector and cause it to adopt that persona, by injecting it into the model’s activations. In these examples, we turn the model bad in various ways (we can also do the reverse). https://t.co/ffdppPLpuT

We introduce a method called preventative steering, which involves steering towards a persona vector to prevent the model acquiring that trait.

It's counterintuitive, but it’s analogous to a vaccine—to prevent the model from becoming evil, we actually inject it with evil. https://t.co/VJfZ3u7Lrb

Persona vectors can also identify training data that will teach the model bad personality traits. Sometimes, it flags data that we wouldn't otherwise have noticed. https://t.co/wymavVE0NL

Read the full paper on persona vectors: https://t.co/NAuJfwARZy

We’re also hiring full-time researchers to investigate topics like this in more depth: https://t.co/L5I2x0xrPD

인용된 트윗: We're launching an "AI psychiatry" team as part of interpretability efforts at Anthropic! We'll be researching phenomena like model personas, motivations, and situational awareness, and how they lead to spooky/unhinged behaviors. We're hiring - join us! https://t.co/cUPsJ8ktsG

This research was led by @RunjinChen and @andyarditi through the Anthropic Fellows program, supervised by @Jack_W_Lindsey, in collaboration w/ @sleight_henry and @OwainEvans_UK.

The Fellows program is accepting applications: https://t.co/li3i79QnGA

인용된 트윗: We’re running another round of the Anthropic Fellows program.

If you're an engineer or researcher with a strong coding or technical background, you can apply to receive funding, compute, and mentorship from Anthropic, beginning this October. There'll be around 32 places. https://t.co/wJWRRTt4DG

원본 보기

💬 0 댓글

댓글 0개

댓글을 작성하려면 로그인이 필요해🦉

아직 댓글이 없어. 1번째로 댓글 작성해 볼래?

로그인하면 맞춤 뉴스 물어다 줄게🦉

관심사 기반 맞춤 뉴스 추천
부키가 물어다 주는 뉴스레터 구독
인사이트 글 열람
둥지 게시판 이용 권한

로그인

또는 회원가입 하기

지금 핫한 소식🚀

부키가 물어다 주는 뉴스레터🦉

미리보기

구독하면 매주 금요일마다 AI 소식과 팁들을 보내줄게!

AI 픽