What are Fedi Admins doing to block Meta scrapers?

The Nexus of Privacy@lemmy.blahaj.zone · edit-2 13 days ago

What are Fedi Admins doing to block Meta scrapers?

Jeena@piefed.jeena.net · edit-2 14 days ago

The only thing I’ve been doing on my blog (not on my piefed instance yet, but probably should) was user agent filtering:

if ($http_user_agent ~* (SemrushBot|AhrefsBot|PetalBot|YisouSpider|Amazonbot|VelenPublicWebCrawler|DataForSeoBot|Expanse,\ a\ Palo\ Alto\ Networks\ company|BacklinksExtendedBot|ClaudeBot|OAI-SearchBot)) {
return 403;
}

The Nexus of Privacy@lemmy.blahaj.zone · 14 days ago

Thanks! Does it seem like that’s affective, or are you getting the feel that that the bots are just changing user agent to get aroud it?

Jeena@piefed.jeena.net · 14 days ago

I got the list from a friend who checks his logs every now and then and adds new not names there.

Rimu@piefed.social · edit-2 14 days ago

There are no PieFed instances in that list. Maybe because Meta is blocked in the default PieFed robots.txt or maybe PieFed is too obscure.

The robots.txt on Mastodon and Lemmy is basically useless.

The Mbin robots.txt is massive but does not block Meta’s crawler so presumably it is not being kept up to date.

Any fedi devs reading this: add these

User-agent: meta-externalagent  
User-agent: Meta-ExternalAgent  
User-agent: meta-externalfetcher  
User-agent: Meta-ExternalFetcher  
User-agent: TikTokSpider  
User-agent: DuckAssistBot  
User-agent: anthropic-ai  
Disallow: /

rhythmisaprancer@piefed.social · 14 days ago

@[email protected] in case you are interested

CameronDev@programming.dev · 14 days ago

Just to clarify your question, are you concerned about metas scrapers causing additional server load, or about them stealing the content?

CrocodilloBombardino@piefed.social · 14 days ago

Not OP but i’d be concerned about both

CameronDev@programming.dev · edit-2 14 days ago

The nature of federation makes the later basically impossible to prevent. All data is federated freely, so all meta has to do is spin up an instance and the data is handed directly to them.

CrocodilloBombardino@piefed.social · 14 days ago

Yeah. It’s really just making them do that kind of work. We can block those instances, though ofc it won’t truly stop them, it’ll change the cost benefit analysis.

That plus anubis or something, and whatever future tech that arises

The Nexus of Privacy@lemmy.blahaj.zone · 14 days ago

Agreed, it’s all about changing the cost-benefit analysis, great framing. And also agreed, blocking – and/or shifting to allow-list federation or something more nuanced (to deal with the point @[email protected] makes about Meta just being able to spin up a new instance – is a really important complement to preventing scraping.

Rimu@piefed.social · 14 days ago

Only from the moment they start the instance. That doesn’t give them historical data.

The Nexus of Privacy@lemmy.blahaj.zone · 14 days ago

Yeah I think most admins are concerned about both. And whether or not it’s “stealing” (in the legal sense), a lot of people want to keep their content and personal information out of these AI systems.