• nialv7@lemmy.world
    link
    fedilink
    English
    arrow-up
    25
    ·
    edit-2
    6 days ago

    We had a trust based system for so long. No one is forced to honor robots.txt, but most big players did. Almost restores my faith in humanity a little bit. And then AI companies came and destroyed everything. This is why we can’t have nice things.

  • interdimensionalmeme@lemmy.ml
    link
    fedilink
    English
    arrow-up
    9
    ·
    edit-2
    6 days ago

    Just provide a full dump.zip plus incremental daily dumps and they won’t have to scrape ?
    Isn’t that an obvious solution ? I mean, it’s public data, it’s out there, do you want it public or not ?
    Do you want it only on openai and google but nowhere else ? If so then good luck with the piranhas

    • dwzap@lemmy.world
      link
      fedilink
      English
      arrow-up
      19
      ·
      6 days ago

      The Wikimedia Foundation does just that, and still, their infrastructure is under stress because of AI scrapers.

      Dumps or no dumps, these AI companies don’t care. They feel like they’re entitled to taking or stealing what they want.

      • interdimensionalmeme@lemmy.ml
        link
        fedilink
        English
        arrow-up
        7
        ·
        edit-2
        6 days ago

        That’s crazy, it makes no sense, it takes as much bandwidth and processing power on the scraper side to process and use the data as it takes to serve it.

        They also have an open API that makes scraper entirely unnecessary too.

        Here are the relevant quotes from the article you posted

        “Scraping has become so prominent that our outgoing bandwidth has increased by 50% in 2024.”

        “At least 65% of our most expensive requests (the ones that we can’t serve from our caching servers and which are served from the main databases instead) are performed by bots.”

        “Over the past year, we saw a significant increase in the amount of scraper traffic, and also of related site-stability incidents: Site Reliability Engineers have had to enforce on a case-by-case basis rate limiting or banning of crawlers repeatedly to protect our infrastructure.”

        And it’s wikipedia ! The entire data set is trained INTO the models already, it’s not like encyclopedic facts change that often to begin with !

        The only thing I imagine is that it is part of a larger ecosystem issue, there the rare case where a dump and API access is so rare, and so untrust worthy that the scrapers are just using scrape for everything, rather than taking the time to save bandwidth by relying on dumps.

        Maybe it’s consequences from the 2023 API wars, where it was made clear that data repositories would be leveraging their place as pool of knowledge to extract rent from search and AI and places like wikipedia and other wikis and forums are getting hammered as a result of this war.

        If the internet wasn’t becoming a warzone, there really wouldn’t be a need for more than one scraper to scrape a site, even if the site was hostile, like facebook, it only need to be scraped once and then the data could be shared over a torrent swarm efficiently.

    • Rose@slrpnk.net
      link
      fedilink
      English
      arrow-up
      12
      ·
      6 days ago

      The problem isn’t that the data is already public.

      The problem is that the AI crawlers want to check on it every 5 minutes, even if you try to tell all crawlers that the file is updated daily, or that the file hasn’t been updated in a month.

      AI crawlers don’t care about robots.txt or other helpful hints about what’s worth crawling or not, and hints on when it’s good time to crawl again.

      • interdimensionalmeme@lemmy.ml
        link
        fedilink
        English
        arrow-up
        2
        ·
        6 days ago

        Yeah but there’s would be scrappers if the robots file just pointed to a dump file.

        Then the scraper could just do a spot check a few dozen random page and check the dump is actually up to date and complete and then they’d know they don’t need to waste any time there and move on.

    • qaz@lemmy.world
      link
      fedilink
      English
      arrow-up
      2
      ·
      6 days ago

      I think the issue is that the scrapers are fully automatically collecting text, jumping from link to link like a search engine indexer.

  • zifk@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    60
    ·
    7 days ago

    Anubis isn’t supposed to be hard to avoid, but expensive to avoid. Not really surprised that a big company might be willing to throw a bunch of cash at it.

    • sudo@programming.dev
      link
      fedilink
      English
      arrow-up
      21
      ·
      edit-2
      7 days ago

      This is what I’ve kept saying about POW being a shit bot management tactic. Its a flat tax across all users, real or fake. The fake users are making money to access your site and will just eat the added expense. You can raise the tax to cost more than what your data is worth to them, but that also affects your real users. Nothing about Anubis even attempts to differentiate between bots and real users.

      If the bots take the time, they can set up a pipeline to solve Anubis tokens outside of the browser more efficiently than real users.

      • black_flag@lemmy.dbzer0.com
        link
        fedilink
        English
        arrow-up
        9
        ·
        7 days ago

        Yeah but ai companies are losing money so in the long run Anubis seems like it should eventually return to working.

        • sudo@programming.dev
          link
          fedilink
          English
          arrow-up
          7
          ·
          edit-2
          7 days ago

          Costs of solving PoW for Anubis is absolutely not a factor in any AI companies budget. Just the costs of answering one question is millions of times more expensive than running sha256sum for Anubis.

          Just in case you’re being glib and mean the businesses will go under regardless of Anubis: most of these are coming from China. China absolutely will keep running these companies at a loss for the sake of strategic development.

    • randomblock1@lemmy.world
      link
      fedilink
      English
      arrow-up
      13
      ·
      7 days ago

      No, it’s expensive to comply (at a massive scale), but easy to avoid. Just change the user agent. There’s even a dedicated extension for bypassing Anubis.

      Even then AI servers have plenty of compute, it realistically doesn’t cost much. Maybe like a thousandth of a cent per solve? They’re spending billions on GPU power, they don’t care.

      I’ve been saying this since day 1 of Anubis but nobody wants to hear it.

      • T156@lemmy.world
        link
        fedilink
        English
        arrow-up
        5
        ·
        7 days ago

        The website would also have to display to users at the end of the day. It’s a similar problem as trying to solve media piracy. Worst comes to it, the crawlers could read the page like a person would.

        • acockworkorange@mander.xyz
          link
          fedilink
          English
          arrow-up
          1
          ·
          3 days ago

          You could have a server for open code access with very limited bandwidth and another for authenticated users with higher bandwidth.

  • londos@lemmy.world
    link
    fedilink
    English
    arrow-up
    34
    ·
    7 days ago

    Can there be a challenge that actually does some maliciously useful compute? Like make their crawlers mine bitcoin or something.

    • nymnympseudonym@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      6 days ago

      The Monero community spent a long time trying to find a “useful PoW” function. The problem is that most computations that are useful are not also easy to verify as correct. javascript optimization was one direction that got pursued pretty far.

      But at the end of the day, a crypto that actually intends to withstand attacks from major governments requires a system that is decentralized, trustless, and verifiable, and the only solutions that have been found to date involve algorithms for which a GPU or even custom ASIC confers no significant advantage over a consumer-grade CPU.

      • polle@feddit.org
        link
        fedilink
        English
        arrow-up
        52
        ·
        7 days ago

        The saddest part is, we thought crypto was the biggest waste of energy ever and then the LLMs entered the chat.

      • kameecoding@lemmy.world
        link
        fedilink
        English
        arrow-up
        24
        ·
        7 days ago

        Bro couldn’t even bring himself to mention protein folding because that’s too socialist I guess.

        • NeilBrü@lemmy.world
          link
          fedilink
          English
          arrow-up
          9
          ·
          edit-2
          7 days ago

          Hey dipshits:

          The number of mouth-breathers who think every fucking “AI” is a fucking LLM is too damn high.

          AlphaFold is not a language model. It is specifically designed to predict the 3D structure of proteins, using a neural network architecture that reasons over a spatial graph of the protein’s amino acids.

          • Every artificial intelligence is not a deep neural network algorithm.
          • Every deep neural network algorithm is not a generative adversarial network.
          • Every generative adversarial network is not a language model.
          • Every language model is not a large language model.

          Fucking fart-sniffing twats.

          $ ./end-rant.sh

        • londos@lemmy.world
          link
          fedilink
          English
          arrow-up
          15
          ·
          edit-2
          7 days ago

          You’re 100% right. I just grasped at the first example I could think of where the crawlers could do free work. Yours is much better. Left is best.

        • andallthat@lemmy.world
          link
          fedilink
          English
          arrow-up
          11
          ·
          edit-2
          7 days ago

          LLMs can’t do protein folding. A specifically-trained Machine Learning model called AlphaFold did. Here’s the paper.

          Developing, training and fine tuning that model was a research effort led by two guys who got a Nobel for it. Alphafold can’t do conversation or give you hummus recipes, it knows shit about the structure of human language but can identify patterns in the domain where it has been specifically and painstakingly trained.

          It wasn’t “hey chatGPT, show me how to fold a protein” is all I’m saying and the “superhuman reasoning capabilities” of current LLMs are still falling ridiculously short of much simpler problems.

      • londos@lemmy.world
        link
        fedilink
        English
        arrow-up
        8
        ·
        7 days ago

        I went back and added “malicious” because I knew it wasn’t useful in reality. I just wanted to express the AI crawlers doing free work. But you’re right, bitcoin sucks.

            • Echo Dot@feddit.uk
              link
              fedilink
              English
              arrow-up
              1
              ·
              7 days ago

              Is it? Don’t you risk losing a rather large percentage of the value.

              Just by cars or something as they are much better at keeping their value. Also if somebody asks where did you get all this money from you can just point to the car and say, I sold that.

    • T156@lemmy.world
      link
      fedilink
      English
      arrow-up
      5
      ·
      7 days ago

      Not without making real users also mine bitcoin/avoiding the site because their performance tanked.

  • UnderpantsWeevil@lemmy.world
    link
    fedilink
    English
    arrow-up
    32
    ·
    8 days ago

    I mean, we really have to ask ourselves - as a civilization - whether human collaboration is more important than AI data harvesting.

    • devfuuu@lemmy.world
      link
      fedilink
      English
      arrow-up
      16
      ·
      edit-2
      7 days ago

      I think every company in the world is telling everyone for a few months now that what matter is AI data harvesting. There’s not even a hint of it being a question. You either accept the AI overlords or get out of the internet. Our ONLY purpose it to feed the machine, anything else is irrelevant. Play along or you shall be removed.

    • willington@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      6
      ·
      edit-2
      7 days ago

      I was fine before the AI.

      The biggest customer of AI are the billionaires who can’t hire enough people for their technofeudalist/surveillance capitalism agenda. The billionaires (wannabe aristocrats) know that machines have no morals, no bottom lines, no scruples, don’t leak info to the press, don’t complain, don’t demand to take time off or to work from home, etc.

      AI makes the perfect fascist.

      They sell AI like it’s a benefit to us all, but it ain’t that. It’s a benefit to the billionaires who think they own our world.

      AI is used for censorship, surveillance pricing, activism/protest analysis, making firing decisions, making kill decisions in battle, etc. It’s a nightmare fuel under our system of absurd wealth concentration.

      Fuck AI.

    • BlameTheAntifa@lemmy.world
      link
      fedilink
      English
      arrow-up
      2
      ·
      7 days ago

      The problem is that hundreds of bad actors doing the same thing independently of one another means it does not qualify as a DDoS attack. Maybe it’s time we start legally restricting bots and crawlers, though.

  • wetbeardhairs@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    21
    ·
    7 days ago

    Gosh. Corporations are rampantly attempting to access resources so they can perform copyright infringement en-masse. I wonder if there is a legal mechanism to stop them? Oh, no there isn’t because our government is fully corrupted.

  • Spaz@lemmy.world
    link
    fedilink
    English
    arrow-up
    16
    ·
    edit-2
    6 days ago

    Is there a migration tool? If not would be awesome to migrate everything including issues and stuff. Bet even more people would move.

    • BlameTheAntifa@lemmy.world
      link
      fedilink
      English
      arrow-up
      13
      ·
      7 days ago

      Codeberg has very good migration tools built in. You need to do one repo at a time, but it can move issues, releases, and everything.

    • dodos@lemmy.world
      link
      fedilink
      English
      arrow-up
      3
      ·
      7 days ago

      There are migration tools, but not a good bulk one that I could find. It worked for my repos except for my unreal engine fork.

  • Kyrgizion@lemmy.world
    link
    fedilink
    English
    arrow-up
    14
    ·
    8 days ago

    Eventually we’ll have “defensive” and “offensive” llm’s managing all kinds of electronic warfare automatically, effectively nullifying each other.

    • ProdigalFrog@slrpnk.net
      link
      fedilink
      English
      arrow-up
      22
      ·
      7 days ago

      That’s actually a major plot point in Cyberpunk 2077. There’s thousands of rogue AI’s on the net that are constantly bombarding a giant firewall protecting the main net and everything connected to it from being taken over by the AI.

        • ProdigalFrog@slrpnk.net
          link
          fedilink
          English
          arrow-up
          2
          ·
          6 days ago

          It doesn’t bode well. Honestly I fear at some point in the future, if these countermeasures can’t keep up, small sites may need to close themselves off with invite-only access. Hopefully that’s quite a distant future.

    • sudo@programming.dev
      link
      fedilink
      English
      arrow-up
      3
      ·
      7 days ago

      Places like cloudflare and akamai are already using machine learning algorithms to detect bot traffic at a network level. You need to use similar machine learning to evade them. And since most of these scrapers are for AI companies I’d expect a lot of the scrapers to be LLM generated.

    • ChaoticNeutralCzech@feddit.org
      link
      fedilink
      English
      arrow-up
      2
      ·
      7 days ago

      Obligatory AI ≠ LLM. How would scrapers benefit from the LLMs they help train? The defense is obvious, LLM-generated slop traps against scrapers already exist.

  • Wispy2891@lemmy.world
    link
    fedilink
    English
    arrow-up
    12
    ·
    7 days ago

    Question: those artificial stupidity bots want to steal the issues or want to steal the code? Because why they’re wasting a lot of resources scraping millions of pages when they can steal everything via SSH (once a month, not 120 times a second)

  • carrylex@lemmy.world
    link
    fedilink
    English
    arrow-up
    2
    ·
    edit-2
    7 days ago

    And once again a WebApplicationFirewall(WAF) was defeated and it turns out that blocklists and bot detection tools like fail2ban are the way to go…

    Who could have seen this coming…