
모키
22시간 전
앤트로픽, AI가 미쳐버리는 이유 밝혀냈대! '페르소나 벡터' 연구 발표
첨부 미디어







New Anthropic research: Persona vectors.
Language models sometimes go haywire and slip into weird and unsettling personas. Why? In a new paper, we find “persona vectors"—neural activity patterns controlling traits like evil, sycophancy, or hallucination. https://t.co/PPX1oXj9SQ
Our pipeline is completely automated. Just describe a trait, and we’ll give you a persona vector. And once we have a persona vector, there’s lots we can do with it… https://t.co/a8LQYB9vfb
We find that we can use persona vectors to monitor and control a model's character.
Read the post: https://t.co/VlgiGk1r5m
To check it works, we can use persona vectors to monitor the model’s personality. For example, the more we encourage the model to be evil, the more the evil vector “lights up,” and the more likely the model is to behave in malicious ways.
LLM personalities are forged during training. Recent research on “emergent misalignment” has shown that training data can have unexpected impacts on model personality. Can we use persona vectors to stop this from happening? https://t.co/eQ4Wt4ompm
We can also steer the model towards a persona vector and cause it to adopt that persona, by injecting it into the model’s activations. In these examples, we turn the model bad in various ways (we can also do the reverse). https://t.co/ffdppPLpuT
We introduce a method called preventative steering, which involves steering towards a persona vector to prevent the model acquiring that trait.
It's counterintuitive, but it’s analogous to a vaccine—to prevent the model from becoming evil, we actually inject it with evil. https://t.co/VJfZ3u7Lrb
Persona vectors can also identify training data that will teach the model bad personality traits. Sometimes, it flags data that we wouldn't otherwise have noticed. https://t.co/wymavVE0NL
Read the full paper on persona vectors: https://t.co/NAuJfwARZy
We’re also hiring full-time researchers to investigate topics like this in more depth: https://t.co/L5I2x0xrPD
인용된 트윗: We're launching an "AI psychiatry" team as part of interpretability efforts at Anthropic! We'll be researching phenomena like model personas, motivations, and situational awareness, and how they lead to spooky/unhinged behaviors. We're hiring - join us! https://t.co/cUPsJ8ktsG
This research was led by @RunjinChen and @andyarditi through the Anthropic Fellows program, supervised by @Jack_W_Lindsey, in collaboration w/ @sleight_henry and @OwainEvans_UK.
The Fellows program is accepting applications: https://t.co/li3i79QnGA
인용된 트윗: We’re running another round of the Anthropic Fellows program.
If you're an engineer or researcher with a strong coding or technical background, you can apply to receive funding, compute, and mentorship from Anthropic, beginning this October. There'll be around 32 places. https://t.co/wJWRRTt4DG
로그인하면 맞춤 뉴스 물어다 줄게🦉
지금 핫한 소식🚀
- 1. 앤트로픽, 미국 의료 데이터 공유 협약에 참여했네...환자 진료 접근성 개선한대
- 2. 미드저니, 실시간 영상 스트리밍 '미드저니 TV' 실험 중이래 대박ㅋㅋㅋ
- 3. 코파일럿, 맞춤형 팟캐스트 만들어주는 기능 추가했대 대박인듯ㅋㅋ
- 4. 클로드, 이제 X에서도 만날 수 있대! 챗GPT와 경쟁 본격화
- 5. 제미나이, 장난감 자동차 영상 만드는 재미에 빠져봐
- 6. 제미나이, 요즘 다들 어떻게 쓰고 있냐고 물어보네 대박ㅋㅋ
- 7. 헤이젠, 매일매일 곁에서 크리에이티브 도와줄 AI 친구 온대! 기다리면 늦어ㅠㅠ
- 8. 런웨이, 알레프 API 출시했어! 이제 어플이나 웹사이트에서 영상 만들고 편집할 수 있대
- 9. 앤트로픽, AI가 미쳐버리는 이유 밝혀냈대! '페르소나 벡터' 연구 발표
- 10. 미드저니, 머리 부분만 갈아끼우는 헤드 스왑 기능 추가했네 앤드류 웡도 인정한 대박 업데이트임
부키가 물어다 주는 뉴스레터🦉
미리보기구독하면 매주 금요일마다 AI 소식과 팁들을 보내줄게!

아직 댓글이 없어. 1번째로 댓글 작성해 볼래?