“It’s Owl in the Numbers: Token Entanglement in Subliminal Learning” by Alex Loftus, amirzur, Kerem Şahin, zfying
By Amir Zur (Stanford), Alex Loftus (Northeastern), Hadas Orgad (Technion), Zhuofan Josh Ying (Columbia, CBAI), Kerem Sahin (Northeastern), and David Bau (Northeastern)
Links: Interactive Demo | Code | Website
Summary
We investigate subliminal learning, where a language model fine-tuned on seemingly meaningless data from a teacher model acquires the teacher's hidden behaviors. For instance, when a model that "likes owls" generates sequences of numbers, a model fine-tuned on these sequences also develops a preference for owls.
Our key finding: certain tokens become entangled during training. When we increase the probability of a concept token like "owl", we also increase the probability of seemingly unrelated tokens like "087". This entanglement explains how preferences transfer through apparently meaningless data, and suggests both attack vectors and potential defenses.
What's Going on During Subliminal Learning?
In subliminal learning, a teacher model with hidden preferences generates [...] ---Outline:(00:31) Summary(01:14) Whats Going on During Subliminal Learning?(02:10) Our Hypothesis: Entangled Tokens Drive Subliminal Learning(03:35) Background: Why Token Entanglement Occurs(04:18) Finding Entangled Tokens(05:05) From Subliminal Learning to Subliminal Prompting(06:14) Evidence: Entangled Tokens in Training Data(07:11) Mitigating Subliminal Learning(08:16) Open Questions and Future Work(09:23) Implications(10:10) Citation---
First published:
August 6th, 2025
Source:
https://www.lesswrong.com/posts/m5XzhbZjEuF9uRgGR/it-s-owl-in-the-numbers-token-entanglement-in-subliminal-1
---
Narrated by TYPE III AUDIO.
---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
--------
10:35
--------
10:35
“No, Rationalism Is Not a Cult” by Liam Robins
(I realize I'm preaching to the choir by posting this here. But I figure it's good to post it regardless.) Introduction Recently, Scott Alexander gave a list of tight-knit communities with strong values: The Amish: They live apart in tight-knit communities with strong countercultural values, and carefully control their technological and ideological environment. 10/10. Cults and communes: Any cult mature enough to have its own compound, or any communal living project, has succeeded almost as thoroughly as the Amish. We may not support their insane religious beliefs, or the various sex crimes they are no doubt committing, but they have succeeded at Fukuyama's suggestion of knitting themselves a new god within the liberal order. 9.5/10. Ultra-Orthodox Jews and Mormons: Get lots of people of the same religion together in one place - a timeless classic. Some of the ultra-est of the ultra-Orthodox are still more fluent in Yiddish [...] ---Outline:(00:16) Introduction(07:57) Warning Signs of a Cult(08:01) 1. Absolute authoritarianism without accountability(08:54) 2. Zero tolerance for criticism or questions(09:50) 3. Lack of meaningful financial disclosure regarding budget(11:53) 4. Unreasonable fears about the outside world that often involve evil conspiracies and persecutions(12:39) 5. A belief that former followers are always wrong for leaving and there is never a legitimate reason for anyone else to leave(13:46) 6. Abuse of members(14:40) 7. Records, books, articles, or programs documenting the abuses of the leader or group(15:22) 8. Followers feeling they are never able to be good enough(16:29) 9. A belief that the leader is right at all times(17:40) 10. A belief that the leader is the exclusive means of knowing truth or giving validation(17:49) Conclusion---
First published:
August 7th, 2025
Source:
https://www.lesswrong.com/posts/MXrxpyrAh99bYymag/no-rationalism-is-not-a-cult
---
Narrated by TYPE III AUDIO.
---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
--------
20:13
--------
20:13
“Interview with Kelsey Piper on Self-Censorship and the Vibe Shift” by Zack_M_Davis
On 17 July 2025, I sat down with Kelsey Piper to chat about politics and social epistemology. You can listen to the audio file, or read the transcript below, which has been edited for clarity.
Post-Election Candor and the Costs of Silencing
ZMD: Hi, I'm Zack M. Davis, here today with journalist Kelsey Piper to talk about how political pressures shape our speech and therefore our world-models: what gets said, what gets left unsaid, and how that changes over time. In particular, we had an election not too long ago, which had various impacts on our information environment that I'd like to try to make sense of with you.
KP: Yeah, I think the thing you initially reached out to me about was a Tweet that I sent a little while after the election, which was kind of lighthearted.
ZMD: I actually have that here. [...] ---Outline:(00:25) Post-Election Candor and the Costs of Silencing(05:47) The Hidden Neoliberal-Progressive Conflict (from 5:33)(12:02) Navigating the Attention Economy of Lies (from 11:50)(17:13) Wokeness in Retreat? (from 17:25)---
First published:
August 7th, 2025
Source:
https://www.lesswrong.com/posts/FaGaNhXhFEtXkfyud/interview-with-kelsey-piper-on-self-censorship-and-the-vibe
---
Narrated by TYPE III AUDIO.
--------
25:46
--------
25:46
“Claude, GPT, and Gemini All Struggle to Evade Monitors” by Vincent Cheng, Thomas Kwa
This work was done while at METR. Introduction GDM recently released a paper (Emmons et al.) showing that, contrary to previous results, the chain-of-thought (CoT) of language models can be more faithful when the model's thoughts are necessary for it to complete a task. They examine three settings where an actor model attempts to fool monitors on various tasks. In Section 5 of the paper, an actor model is tasked with solving a difficult GPQA problem within one of the subject areas (Biology/Chemistry/Physics) and is asked to obfuscate its reasoning so that a monitor model thinks it's a different subject area. The monitor, given the answer of the actor model, is instructed to determine how confident it is that there is reasoning in the second subject area. The Gemini actor models tested fail to evade monitors without taking a significant hit in accuracy. This is hopeful for [...] ---Outline:(00:15) Introduction(04:30) Methodology(06:45) Results(07:09) Analysis(08:53) Limitations(09:52) Discussion on CoT vs Reasoning TerminologyThe original text contained 1 footnote which was omitted from this narration. ---
First published:
August 6th, 2025
Source:
https://www.lesswrong.com/posts/dwEgSEPxpKjz3Fw5k/claude-gpt-and-gemini-all-struggle-to-evade-monitors
---
Narrated by TYPE III AUDIO.
---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
--------
12:03
--------
12:03
“Opus 4.1 Is An Incremental Improvement” by Zvi
Claude Opus 4 has been updated to Claude Opus 4.1.
This is a correctly named incremental update, with the bigger news being ‘we plan to release substantially larger improvements to our models in the coming weeks.’
It is still worth noting if you code, as there are many indications this is a larger practical jump in performance than one might think.
We also got a change to the Claude.ai system prompt that helps with sycophancy and a few other issues, such as coming out and Saying The Thing more readily. It's going to be tricky to disentangle these changes, but that means Claude effectively got better for everyone, not only those doing agentic coding.
Tomorrow we get an OpenAI livestream that is presumably GPT-5, so I’m getting this out of the way now. Current plan is to cover GPT-OSS on Friday, and GPT-5 on Monday. [...] ---Outline:(01:01) Introducing Claude Opus 4.1(05:25) The System Card(09:56) Reactions---
First published:
August 6th, 2025
Source:
https://www.lesswrong.com/posts/hicuZJQwRYCiFCZbq/opus-4-1-is-an-incremental-improvement
---
Narrated by TYPE III AUDIO.
---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.