• Rose@slrpnk.net
    link
    fedilink
    English
    arrow-up
    12
    ·
    8 days ago

    The problem isn’t that the data is already public.

    The problem is that the AI crawlers want to check on it every 5 minutes, even if you try to tell all crawlers that the file is updated daily, or that the file hasn’t been updated in a month.

    AI crawlers don’t care about robots.txt or other helpful hints about what’s worth crawling or not, and hints on when it’s good time to crawl again.

    • interdimensionalmeme@lemmy.ml
      link
      fedilink
      English
      arrow-up
      2
      ·
      8 days ago

      Yeah but there’s would be scrappers if the robots file just pointed to a dump file.

      Then the scraper could just do a spot check a few dozen random page and check the dump is actually up to date and complete and then they’d know they don’t need to waste any time there and move on.