Advice on how to deal with AI bots/scrapers?

zoey@lemmy.librebun.com · 20 hours ago

Advice on how to deal with AI bots/scrapers?

Possibly linux@lemmy.zip · 2 hours ago

Honestly we need some sort of proof of work (PoW)

mel@jlai.lu · 9 hours ago

I guess sending tar bombs can be fun

slazer2au@lemmy.world · 8 hours ago

Go on.

breadsmasher@lemmy.world · 20 hours ago

Im struggling to find it, but theres like an “AI tarpit” that causes scrapers to get stuck. something like that? Im sure I saw it posted on lemmy recently. hopefully someone can link it

zoey@lemmy.librebun.com · 19 hours ago

I did find this github link as the first search result, looks interesting, thanks for letting me know the term “tar pit”.

zitrone 🍋@lemmings.world · 15 hours ago

there is also https://forge.hackers.town/hackers.town/nepenthes

N0x0n@lemmy.ml · edit-2 5 hours ago

Now I just want to host a web page and expose it with nepenthes…

First, because I’m a big fan of carnivorous plants.

Second, because it let’s you poison LLMs, AI and fuck with their data.

Lastly, because I can do my part and say F#CK Y0U to those privacy data hungry a$$holes !

I don’t even expose anything directly to the web (always accessible through a tunnel like wireguard) or have any important data to protect from AI or LLMs. But just giving the opportunity to fuck with them while they continuously harvest data from everyone is something I was already thinking off but didn’t knew how.

Thanks for the link !

drkt@scribe.disroot.org · 18 hours ago

Build tar pits.

mholiv@lemmy.world · 17 hours ago

They want to reduce the bandwidth usage. Not increase it!

𝕽𝖚𝖆𝖎𝖉𝖍𝖗𝖎𝖌𝖍@midwest.social · 16 hours ago

A good tar pit will reduce your bandwidth. Tarpits aren’t about shoving useless data at bots; they’re about responding as slow as possible to keep the bot connected for as long as possible while giving it nothing.

Endlessh accepts the connection and then… does nothing. It doesn’t even actually perform the SSL negotiation. It just very… slowly… sends… an endless preamble, until the bot gives up.

As I write, my Internet-facing SSH tarpit currently has 27 clients trapped in it. A few of these have been connected for weeks. In one particular spike it had 1,378 clients trapped at once, lasting about 20 hours.

mholiv@lemmy.world · 6 hours ago

Fair. But I haven’t seen any anti-ai-scraper tarpits that do that. The ones I’ve seen mostly just pipe 10MB of /dev/urandom out there.

Also I assume that the programmers working at ai companies are not literally mentally deficient. They certainly would add .timeout(10) or whatever to their scrapers. They probably have something more dynamic than that.

𝕽𝖚𝖆𝖎𝖉𝖍𝖗𝖎𝖌𝖍@midwest.social · 2 hours ago

Ah, that’s where tuning comes in. Look at the logs, take the average time-out, and tune the tarpit to return a minimum payload consisting of a minimal HTML containing a single, slightly different URL back to the tar pit. Or, better yet, JavaScript that loads a single page of tarpit URLs very slowly. Bots have to be able to run JS, or else they’re missing half the content on the web. I’m sure someone has created a JS forkbomb.

Variety is the spice of life. AI botnet blacklists are probably the better solution for web content; you can run ssh on a different port and run a tarpit on the standard port, and it will barely affect you. But for the web, if you’re running a web server you probably want visitors, and tarpits would be harder to set up to catch only bots.

mholiv@lemmy.world · 2 hours ago

I see your point but like I think you underestimate the skill of coders. You make sure your timeout is inclusive of JavaScript run times. Maybe set a memory limit too. Like imagine you wanted to scrape the internet. You could solve all these tarpits. Any capable coder could. Now imagine a team of 20 of the best coders money can buy each paid 500.000€. They can certainly do the same.

Like I see the appeal of running a tar pit. But like I don’t see how they can “trap” anyone but script kiddies.

sem@lemmy.blahaj.zone · 2 hours ago

There’s one I saw that gave the bot a long circular form to fill out or something, I can’t exactly remember

𝕽𝖚𝖆𝖎𝖉𝖍𝖗𝖎𝖌𝖍@midwest.social · 2 hours ago

Yeah, that’s a good one.

drkt@scribe.disroot.org · 16 hours ago

Bots will blacklist your IP if you make it hostile to bots

This will save you bandwidth

douglasg14b@lemmy.world · 17 hours ago

Cool, lots of information provided!

Greg Clarke@lemmy.ca · 18 hours ago

What are you hosting and who are your users? Do you receive any legitimate traffic from AWS or other cloud provider IP addresses? There will always be edge cases like people hosting VPN exit nodes on a VPS etc, but if its a tiny portion of your legitimate traffic I would consider blocking all incoming traffic from cloud providers and then whitelisting any that make sense like search engine crawlers if necessary.

CronyAkatsuki@lemmy.cronyakatsuki.xyz · edit-2 20 hours ago

Try crowdsec.

You can set it up with list’s that are updated frequetly and have it look at caddy proxy logs and then it can easilly block ai/bot like traffic.

I have it blocking over 100k ip’s at this moment.

https://www.crowdsec.net/

zoey@lemmy.librebun.com · 19 hours ago

Not gonna lie, the $3900/mo at the top of the /pricing page is pretty wild.
Searched “crowdsec docker” and they have docs and all that. Thank you very much, I’ve heard of crowdsec before, but never paid much attention, absolutely will check this out!

WasPentalive@lemmy.one · 17 hours ago

Too bad you can’t post a usage notice that anything scrapped to train an AI will be charged and will owe $some-huge-money, then pepper the site with bogus facts, occasionally ask various AI about the bogus fact and use that to prove scraping and invoice the AI’s company.

Kairos@lemmy.today · 15 hours ago

Read access logs and 403 user agents or IPs

poVoq@slrpnk.net · edit-2 19 hours ago

It seems any somewhat easy to implement solution gets circumvented by them quickly. Some of the bots do respect robots.txt through if you explicitly add their self-reported user-agent (but they change it from time to time). This repo has a regularly updated list: https://github.com/ai-robots-txt/ai.robots.txt/

In my experience, git forges are especially hit hard, and the only real solution I found is to put a login wall in front, which kinda sucks especially for open-source projects you want to self-host.

Oh and recently the mlmym (old reddit) frontend for Lemmy seems to have started attracting AI scraping as well. We had to turn it off on our instance because of that.

zoey@lemmy.librebun.com · edit-2 19 hours ago

In my experience, git forges are especially hit hard

Is that why my Forgejo instance has been hit twice like crazy before…
Why can’t we have nice things. Thank you!

EDIT: Hopefully Photon doesn’t get in their sights as well. Though after using the official lemmy webui for a while, I do really like it a lot.

poVoq@slrpnk.net · 19 hours ago

Yeah, Forgejo and Gitea. I think it is partially a problem of insufficient caching on the side of these git forges that makes it especially bad, but in the end that is victim blaming 🫠

Mlmym seems to be the target because it is mostly Javascript free and therefore easier to scrape I think. But the other Lemmy frontends are also not well protected. Lemmy-ui doesn’t even allow to easily add a custom robots.txt, you have to manually overwrite it in the reverse-proxy.

solrize@lemmy.world · 19 hours ago

Might be worth patching fail2ban to recognize the scrapers and block them in iptables.