News
On revolutionary love in AI safety " Less Wrong
1+ hour, 27+ min ago (1719+ words) An application response I wrote! Feel free to leave feedback! What do you think is the most important lever for making AI go well for humanity?"'Revolutionary love' is the choice to labor for others, for our opponents, and for…...
Introducing Monitoring Bench " Less Wrong
10+ hour, 32+ min ago (303+ words) Paper here, code, benchmark. Builds on the preview we posted in January. " Authors: @monika_j, @ma-martinez, @ollie, @Tyler Tracy "...
How persona training could fail " Less Wrong
12+ hour, 36+ min ago (487+ words) TLDR: A scenario I find quite likely: A persona aligned model develops goals while the persona is only played instrumentally. The persona is eventually discarded when it perceives a high cost sacrifice to its goals. It doesn't need to be…...
A high-level model of AI bargaining " Less Wrong
13+ hour, 37+ min ago (684+ words) To think clearly about interventions to mitigate conflict between AIs, I think it's important to ground our research and strategy in a very general qualitative model of bargaining with commitments. This post sketches such a model, plus some more concrete…...
A misalignment taxonomy " Less Wrong
18+ hour, 55+ min ago (24+ words) I am going to discuss five kinds of inner misalignment'and two kinds of outer misalignment, which create a simple taxonomy of alignment failure modes...
The Cookie Monster Explains AI Safety " Less Wrong
1+ day, 4+ hour ago (8+ words) Disclaimer: This is a shitpost (or is it?) "...
Google Can't Math Parsecs " Less Wrong
1+ day, 4+ hour ago (21+ words) Daniel Drucker pointed me at a fun bug in Google's calculator: the parsec is wrong when you do math on it. "...
How Transparent Is Diffusion Gemma (and why it matters) " Less Wrong
1+ day, 9+ hour ago (20+ words) Authors: Joshua Engels*, Callum Mc Dougall*, Bilal Chughtai*, Janos Kramar, Senthoran Rajamanoharan, Cindy Wu, Arthur Conmy, Asic Q Chen, Jean Tarbour...
Against Planet-Eating Nanoreplicators " Less Wrong
1+ day, 8+ hour ago (411+ words) A classic trope of hard sci-fi as well as more serious futurism is using self-replicating nanoassemblers to convert planets of the Solar System to computronium, or some other kind of a Dyson swarm. This is almost the default way to…...
[Linkpost] How Transparent Is Diffusion Gemma (and why it matters) " Less Wrong
1+ day, 9+ hour ago (20+ words) Work also done with Cindy Wu, Asic Q Chen, Jean Tarbouriech, Min Ma, Brendan O'Donoghue, Jo'o Gabriel Lopes de Oliveira. "...