How Large Language Models De-Anonymize Online Posts at Scale
We like to think a username is a mask. A Reddit handle. A Hacker News alias. Just words on a screen.
But large language models are very good at spotting patterns we don’t see. They don’t need a real name. They just need fragments — writing style, movie preferences, small biographical hints — and they start connecting dots.
Researchers from ETH Zurich and the MATS research fellowship associated with Berkeley demonstrated how this works in practice. They trained a system to analyze posts from anonymous platforms like Reddit and cross-reference them with leaked or public datasets tied to real identities.
The result? Anonymous posts weren’t so anonymous anymore.
Not because someone hacked an account.
Because the text itself gave people away.
Reddit Movie Posts Matched to Real Netflix Accounts
Cross-Platform Data Linking Using AI
One experiment focused on movie-related Reddit posts. Researchers collected user activity across several film-focused subreddits. Then they compared that data to information from a Netflix data leak.
The AI model looked for overlapping signals — recommendations, preferences, phrasing patterns.
Here’s what happened:
- With one movie recommendation, 3.1% of anonymous users could be matched to a specific named Netflix account with 90% accuracy.
- With five to nine recommendations, that number jumped to 23.2%.
- With more than ten recommendations, 48.1% of users could be identified.
- 17% were identified with near-total confidence.
That’s almost half.
And these weren’t detailed confessions. Just movie suggestions. Casual posts. The kind you’d write in 30 seconds.
It shows how even small data points, when combined, create a fingerprint.
Hacker News Posts Connected to LinkedIn Profiles
Identifying Real Identities Through Professional Clues
Another experiment connected anonymous Hacker News accounts with confirmed LinkedIn profiles.
Users shared general information over time — things like:
- Their field of work
- Their approximate age
- Their city
- Tools they use
Nothing that screams, “Here’s my full identity.”
But piece those clues together, and the model could identify real individuals with a high degree of certainty.
This isn’t new in theory. A determined private investigator could do something similar manually. What changes everything is automation.
The AI system can scale. It can scan thousands of accounts quickly. That’s the part that shifts this from inconvenience to risk.
Anonymous Quiz Responses That Revealed Real-World Identities
Stylometry, Language Patterns, and Behavioral Signals
In one especially revealing test, participants completed a 10-minute anonymous quiz. Just text responses. No names attached.
Seven percent of 125 users were individually identified based on their answers.
How?
- Job descriptions (“I work in biology, on research”)
- Education history references
- Specific tools mentioned
- Language patterns (like UK spelling of “analysing”)
Even spelling conventions became clues.
It’s unsettling when you think about it. Because it means identity isn’t just what you explicitly state. It’s how you say things. The words you naturally choose.
The Real Risk of AI-Powered Deanonymization
Automation and Mass Surveillance Potential
The research doesn’t claim every anonymous account can be unmasked. That’s important.
But it does show this:
The more personal information you share — even if it feels vague — the more vulnerable you become.
People have been doxxing each other for decades. Law enforcement and investigators have long used digital breadcrumbs. That’s not new.
What is new is scale.
An automated system can:
- Trawl massive datasets
- Detect confident associations
- Cross-reference anonymous and non-anonymous posts
- Perform deanonymization at industrial speed
The researchers warn that large language models can empower criminals and state actors in this way.
And that matters especially for vulnerable groups who rely on anonymous communities for safety and expression.
How Platforms and AI Vendors Can Reduce Deanonymization Risk
API Restrictions and Monitoring for Mass Data Scraping
The research suggests platform-level solutions:
- Stricter limits on API access for large language models
- Monitoring suspicious usage patterns
- Detecting mass deanonymization campaigns
Platforms like Reddit could reduce how easily automated systems ingest user data at scale.
AI vendors could monitor usage to flag attempts to connect anonymous and real-world identities.
But these solutions depend on enforcement. And consistency. And incentives.
Which leads to the more personal side of this.
The Most Reliable Way to Protect Your Anonymous Identity
Minimize Personal Data Exposure Online
The simplest defense isn’t technical.
It’s restraint.
The most reliable way to prevent your personal data from being associated with an anonymous account is to make sure that data never appears online in the first place.
Not your job.
Not your city.
Not unique combinations of hobbies.
Not identifiable writing patterns repeated across platforms.
It sounds extreme. But the research shows how quickly small disclosures accumulate.
And once the data is public, you don’t control how it’s analyzed.

