제미나이, 물리학 최고난도 테스트에서 9% 성적으로 1위 먹었네

부키

어제

구글 앤트로픽 제미나이 챗GPT 클로드

새로운 '최전방 물리학 평가'가 나왔는데, 어떤 AI도 9% 이상 성적을 못 받았대. 이건 CritPt라는 테스트로 복잡한 연구에 통합적 사고를 적용하는 물리학 시험이란 뜻임. 전 세계 30개 이상 기관의 60명 이상 연구자들이 만들었고, 대학원생 수준의 최신 물리학 문제를 풀 수 있는지 테스트하는 거래. 이런 문제들은 실제 박사과정 학생들도 힘들게 풀 정도라 AI한테는 엄청 어려운 거임 ㅋㅋ 테스트 결과 구글의 신형 '제미나이 3 프로'가 9.1%로 1등했대. 다른 많은 AI들은 5번의 시도에도 단 한 문제도 못 풀었다네? 이 테스트가 얼마나 어려운지 알 수 있지. 이런 테스트가 중요한 이유는 AI가 실제로 얼마나 깊은 사고와 추론을 할 수 있는지 보여주기 때문임. 근데 아직은 AI가 진짜 깊은 물리학 문제는 못 푸는 단계인 것 같아. 사람의 전문성을 대체하기엔 멀었다는 증거지 🦉

첨부 미디어

@scaling01

어제

A new frontier physics eval form Artificial Analysis

Of course Gemini 3 Pro is #1

인용된 트윗: We’re launching a new frontier physics eval on Artificial Analysis where no model achieves greater than 9%: CritPt (Complex Research using Integrated Thinking - Physics Test)

Developed by 60+ researchers from 30+ institutions across the world including the Argonne National Laboratory and University of Illinois Urbana-Champaign, some of whom have previously worked on leading benchmarks such as SciCode and SWE-Bench, this evaluation tests language models’ reasoning abilities on novel, frontier physics problems suitable for a post-graduate researcher.

We’ve worked with the CritPt developers to launch their new benchmark, and are especially excited about several key elements differentiating this from other reasoning tests:

➤ True frontier evaluation: This benchmark tests models on physics research suitable for graduate-level researchers, with questions and answers written and tested by experts (e.g., postdocs and physics professors) in their subfields

➤ Hard for even frontier models: On release, the highest-scoring model was Google’s new Gemini 3 Pro Preview, with an accuracy of 9.1% (without tool use allowed). Many models fail to solve a single problem even given 5 attempts

➤ Diverse question set: The evaluation test set includes 70 total end-to-end research problem ‘challenges’ covering 11 physics subdomains: condensed matter, quantum physics, AMO, astrophysics, high energy, mathematical physics, statistical physics, nuclear physics, nonlinear dynamics, fluid dynamics, and biophysics

➤ Reflective of research assistant capabilities: Each challenge is designed to be feasible for a capable junior PhD student as a standalone project, but unseen in publicly-available materials. This means most problems require deep understanding and reasoning in frontier physics beyond the capabilities of today’s language models, but all are feasible to solve and independently verified

Congratulations to @MiniHui_zhu, @MinyangTian1, @haopeng_uiuc, and the broader CritPt team on this exciting new evaluation!

See below for further discussion of this eval, analysis, and where to learn more

원본 보기

💬 0 댓글