• sk1nnym1ke@piefed.social
    link
    fedilink
    English
    arrow-up
    0
    ·
    edit-2
    6 months ago

    Too lazy to create the meme. Insert the two astronauts looking at earth meme

    Wait, there is no decentralized internet?

    Always has been.

  • Dave@lemmy.nz
    link
    fedilink
    English
    arrow-up
    0
    ·
    6 months ago

    Running an instance without cloudflare in front is hard work, because AI scrapers bring it to it’s knees. It’s a never ending battle to block them even with Cloudflare, at least Cloudflare can help reduce the load, and even the free version comes with many tools to identify and block problematic bots.

    Though if you turn on bot blocking you break federation, so you have to be a lot more refined in your security rules.

    • Of the Air (cele/celes)@lemmy.blahaj.zone
      link
      fedilink
      English
      arrow-up
      0
      ·
      6 months ago

      because AI scrapers bring it to it’s knees

      There are three (at least) piece of web software to protect from AI Scrapers currently, it should be more than possible without Cloudflare.

      • Dave@lemmy.nz
        link
        fedilink
        English
        arrow-up
        0
        ·
        edit-2
        6 months ago

        Cloudflare’s bot detection triggers the blocking because federation looks a lot like a bot (well, it is a bot).

        For example, Lemmy.world will send my instance hundreds of thousands if not millions of requests a day, in a near steady stream. It’s telling my instance about every post, comment, or vote. AI scrapers send hundreds of thousands of requests or millions in a near steady stream each day.

        For all intents and purposes, federation is bot traffic and looks just like it. Typically I block by identifying high traffic ASNs (a group of IPs run by the same entity, because blackhat AI scrapers use many IPs) and showing a cloudflare challenge (which will typically have a 0% pass rate). If it’s from 1IP then it’s probably a federated instance, but I typically see many IPs from the same area spread with an even spread of requests.

        I also try to exclude federation/API endpoints, which can help stop false positives as scrapers are generally loading the web page.

        This is something Lemmy (and PieFed, Mbin) admins try to help each other with strategies for because one day a bot will find you and suddenly your instance is down because they are hammering you too hard.

        I bet if you are in China, Brazil, Singapore, Argentina, etc then you will see a lot of blocked content on Lemmy, as this is often where the bot traffic comes from (Google, Facebook, OpenAI, Amazon, etc will typically respect the robots.txt so US traffic is less of an issue).