Codeberg: army of AI crawlers are extremely slowing us; AI crawlers learned how to solve the Anubis challenges.

Pro@programming.dev · edit-2 11 天前

Codeberg: army of AI crawlers are extremely slowing us; AI crawlers learned how to solve the Anubis challenges.

IndescribablySad@threads.net@sh.itjust.works · 12 天前

I really feel like scrapers should have been outlawed or actioned at some point.

Programmer Belch@lemmy.dbzer0.com · 12 天前

I use a tool that downloads a website to check for new chapters of series every day, then creates an RSS feed with the contents. Would this be considered a harmful scraper?

The problem with AI scrapers and bots is their scale, thousands of requests to webpages that the internal server cannot handle, resulting in slow traffic.

IndescribablySad@threads.net@sh.itjust.works · edit-2 12 天前

Seems like an api request would be preferable for the site you’re checking. I don’t imagine they’re unhappy with the traffic if they haven’t blocked it yet

JPAKx4@lemmy.blahaj.zone · 12 天前

I mean if it’s cms site there may not be an api, this would be the only solution in that case

ulterno@programming.dev · 12 天前

If the site is getting slowed at times (regardless of whether it is when you scrape), you might want to not scrape at all.

Probably not a good idea to download the whole site, but then that depends upon the site.

If it is a static site, if you just setup your scraper to not download CSS/JS and images/videos, that should make a difference.
For a dynamically created site, there’s nothing I can say
Then again, if you try to reduce your download to what you are using, as much as possible, that might be good enough
Since sites are originally made for human consumption, you might have considered keeping the link traversal rates similar to that
The best would be if you could ask the website dev whether they have an API available.
- Even better, ask them to provide an RSS feed.

Programmer Belch@lemmy.dbzer0.com · 12 天前

As far as I know, the website doesn’t have an API but I just download the HTML and format the result with a simple Python script, it makes around 10 to 20 requests, one for each series I’m following at each time.

ulterno@programming.dev · 12 天前

That might/might not be much.
Depends upon the site, I’d say.

e.g. If it’s something like Netflix, I wouldn’t think much, because they have the means to serve the requests.
But for some PeerTube instance, even a single request seems to be too heavy for them. So if that server does not respond to my request, I usually wait for an hour or so before refreshing the page.

Flax@feddit.uk · 12 天前

The problem is these are constant army hordes / datacentres. You have one tool. Sending a few requests from your device wouldn’t even dent a raspberry pi, nevermind a beefier server

I think the intention of traffic is also important. Your tool is so you can consume the content freely provided by the website. Their tool is so they can profit off of the work on the website.

deur@feddit.nl · edit-2 8 天前

deleted by creator

grue@lemmy.world · 12 天前

But html is machine-readable and that absolutely is the point!

Never forget what they stole from us.

FizzyOrange@programming.dev · 12 天前

So search engines shouldn’t exist? This is absurdly simplistic.

folken@lemmy.world · edit-2 11 天前

When you realize that you live in a cyberpunk novel. The AI is cracking the ICE. https://cyberpunk.fandom.com/wiki/Black_ICE

Regrettable_incident@lemmy.world · 11 天前

I love seeing how much influence William Gibson had on cyberpunk.

ThePyroPython@lemmy.world · 11 天前

It’s not intentional but the chap ended up writing works that defined both the Cyberpunk (Neuromancer) and Steampunk (The Difference Engine) genres.

Can’t deny that influence.

MeThisGuy@feddit.nl · 11 天前

most the ICE I’ve read about are white.

haven’t tried it, it’s in the closed apples store… but it’s a start…

https://apps.apple.com/us/app/iceblock/id6741939020

chicken@lemmy.dbzer0.com · 12 天前

Seems like such a massive waste of bandwidth since it’s the same work being repeated by many different actors to piece together the same dataset bit by bit.

chuckleslord@lemmy.world · 12 天前

Ah Capitalism! Truly the king of efficiency /s

0_o7@lemmy.dbzer0.com · 11 天前

I blocked almost all big players in hosting, China, Ruasia, Vietnam and now they’re now bombarding my site with residential IP address from all over the world. They must be using compromised smart home devices or phones with malware.

Soon everything on the internet will be behind a wall.

irelephant [he/him]@programming.dev · 11 天前

This isn’t sustainable for the ai companies, when the bubble pops it will stop.

aev_software@programming.dev · 11 天前

In the mean time, sites are getting DDOS-ed by scrapers. One way to stop your site from getting scraped is having it be inaccessible… which is what the scalpers are causing.

Normally I would assume DDOS-ing is performed in order to take a site offline. But ai-scalpers require the opposite. They need their targets online and willing. One would think they’d be a bit more careful about the damage they cause.

But they aren’t, because capitalism.

ExLisper · 11 天前

There are many commercial VPNs offering residential IPs. I doubt they use malware.

ILikeTraaaains@lemmy.world · 10 天前

Not necessarily compromised, I saw a VPN provider (don’t remember the name) that offered a free tier where the client accepts being used for this.

And I suspect that in the future some VPN companies will be exposed doing the same but with their paid customers.

Blackmist@feddit.uk · 11 天前

Business idea: AWS, but hosted entirely within the computing power of AI web crawlers.

Kissaki@feddit.org · 11 天前

Reminds me of the “store data inside slow network requests for the in-transit duration”. It was a fun article to read.

harambe69@lemmy.dbzer0.com · 11 天前

Link, please?

nik9000@programming.dev · 11 天前

I believe they are talking about Harder Drive: https://youtu.be/JcJSW7Rprio

harambe69@lemmy.dbzer0.com · 10 天前

Thanks!

excral@feddit.org · 11 天前

I like the idea but couldn’t you just go the more direct route and mine crypto?

sp3ctr4l@lemmy.dbzer0.com · 11 天前

Do we all want the fucking Blackwall from Cyberpunk 2077?

Fucking NetWatch?

Because this is how we end up with them.

…excuse me, I need to go buy a digital pack of cigarettes for the angry voice in my head.

somerandomperson@lemmy.dbzer0.com · 11 天前

Consider nicotine+

sp3ctr4l@lemmy.dbzer0.com · edit-2 11 天前

What was that?

I was sucking on my nicotine nipple, err, I mean my vape.

(Hey, its a more affordable stimulant addiction than coffee now!)

somerandomperson@lemmy.dbzer0.com · edit-2 11 天前

No, not the drug; the app.

sp3ctr4l@lemmy.dbzer0.com · edit-2 11 天前

Oh, well shit, I had not heard of this lol.

I am partial to I2P as … potentially, an entirely new, full internet paradigm, not just filesharing, but I will look into this too!

somerandomperson@lemmy.dbzer0.com · 10 天前

It’s a soulseek client, basically. You can share files, chat, put your interests in your profile, etc. It’s basically like social media, minus the posts. The only algorithm that exists is the one that shows people with similar interests. You can also view the most common interests. You can also add disinterests, which are the exact opposite.

sp3ctr4l@lemmy.dbzer0.com · 10 天前

That does sound very intetesting!

ryanvade@lemmy.world · 12 天前

It’s being investigated at least, hopefully a solution can be found. This will probably end up in a constantly escalating battle with the AI companies. https://github.com/TecharoHQ/anubis/issues/978

rozodru@lemmy.world · 11 天前

I run my own gitea instance on my own server and within the past week or so I’ve noticed it just getting absolutely nailed. One repo in particular, a Wayland WM I built. Just keeps getting hammered over and over by IPs in China.

ZILtoid1991@lemmy.world · 11 天前

Just keeps getting hammered over and over by IPs in China.

Simple solution: Block Chinese IPs!

witten@lemmy.world · 11 天前

Are you using Anubis?

Justas🇱🇹@sh.itjust.works · 11 天前

Why aren’t you firewalling it to only allow your IP? Are you sharing your code with third parties?

rozodru@lemmy.world · 11 天前

a few repos i’ve made available to the public, the wayland wm for example, so I haven’t gotten around to blocking IPs just yet.

Timber@lemmy.blahaj.zone · 11 天前

Are those blocklists publicly available somewhere?

Taldan@lemmy.world · 11 天前

I would hope not. Kinda pointless if they become public

daniskarma@lemmy.dbzer0.com · 11 天前

On the contrary. Open community based block lists can be very effective. Everyone can contribute to them and asphyxiate people with malicious intents.

If you think something like, “if the blocklist is available then malicious agents simply won’t use that ips” I don’t think if that makes a lot of sense. As the malicious agent will know any of their IPs being blocked as soon as they use them.

Phineaz@feddit.org · 11 天前

I suppose it depends on whether others contribute or not.

rozodru@lemmy.world · 11 天前

They’re getting hammered again this morning.

Harbinger01173430@lemmy.world · 11 天前

A good solution would be to load with a virus, to the PCs connecting from the AI ips, that overloads the computer and makes it explode.