News
How do LLMs generalize when we do training that is intuitively compatible with two off-distribution behaviors? " Less Wrong
53+ min ago (1731+ words) Thanks to Eric Gan and Aghyad Deeb for feedback on a draft of this post. When is a "deceptively aligned" policy capable of surviving training? Answers to this question could be useful for a number of reasons: maybe they'd tell…...
Opus 4. 7 Part 1: The Model Card " Less Wrong
1+ hour ago (1805+ words) Less than a week after completing coverage of Claude Mythos, here we are again as Anthropic gives us Claude Opus 4. 7. So here we are, with another 232 pages of light reading. This post covers the first six sections of the Model…...
Gemma Gets Help: Mitigating Frustration and Self-Deletion with Consistency Training " Less Wrong
1+ hour, 44+ min ago (947+ words) This was work done by Neil Shah and supervised by David Africa as part of the SPAR Research Fellowship. Soligo et al. (2026), found that various Gemma and Gemini models became frustrated after being rejected several times on a diverse problem…...
9 kinds of hard-to-verify tasks " Less Wrong
3+ hour, 7+ min ago (1122+ words) Introduction Some people talk about "hard-to-verify tasks" and "easy-to-verify tasks" like these are both natural kinds. But I think splitting tasks into "easy-to-verify" and "hard-to-verify" is like splitting birds into ravens and non-ravens. Easy-to-verify tasks are easy for the same…...
Why clinical trials are broken & how to fix them: a reading list " Less Wrong
4+ hour, 8+ min ago (1128+ words) 12 articles including 4 podcasts Since the 1950s, the cost of developing a new drug has increased by ~80x. It now costs on the order of a billion dollars to get one drug approved (including the cost of failures). Consequently, fewer drugs get invented,…...
Pivotal Research Fellowship applications are open (deadline May 3) " Less Wrong
4+ hour, 38+ min ago (452+ words) AI may be the most consequential technology humanity builds, and whether it goes well depends in large part on how many talented people are working seriously on making it go well. The'Pivotal Research Fellowship (a 9-week in-person research program in…...
Automating philosophy if Timothy Williamson is correct " Less Wrong
4+ hour, 16+ min ago (307+ words) Timothy Williamson[1] thinks that philosophy[2] is far less distinct as a science as many people believe, including philosophers themselves. I've read a bunch of his stuff, and here are the claims I think constitute his view: Williamson typically argues by…...
CLR's Safe Pareto Improvements Research Agenda " Less Wrong
8+ hour, 22+ min ago (1696+ words) What do SPIs look like? The rough idea is to mitigate the costs of conflict, but commit to bargain as if the costs were the same. Two key examples: Later, we'll come back to the question of when agents would…...
My Last 7 Blog Posts: a weekly round-up " Less Wrong
10+ hour, 40+ min ago (321+ words) This is a weekly round-up of things I've posted in the last week. So now you get to catch up! You can even be selective if you prefer:) Contra Leicht on AI Pauses takes apart Anton Leicht's piece arguing we…...
Quality Matters Most When Stakes are Highest " Less Wrong
10+ hour, 57+ min ago (525+ words) Or, the end of the world is no excuse for sloppy work One morning when I was nine, my dad called me over to his computer. He wanted to show me this amazing Korean scientist who had managed to clone…...