I linked to one of my recent blog posts on Mastodon this past week, and was dismayed to discover that a new “follower” (I put that word in scare quotes for a reason) scraped the contents of my post a few seconds after it went up and uploaded a crappy multi-page AI-generated summary of the post to their website (a digital marketing / web dev company) without my permission and without providing ANY credit to me.
The account itself is a bot account. In the past twenty-four hours alone, it has scraped and uploaded PDF summaries of 31 separate links posted by authors all over Mastodon. I suspect it is listening to certain hashtags; I used #blog#blogging#indieweb#yaml#jekyll#webdev on the post that got scraped.
I really don’t care when bots automatically boost hashtagged posts (that’s actually helpful!), but I draw the freaking line at bots scraping and reposting my work on another website without credit. I suppose it’s on me for not realizing that there were probably bots out there in the Fediverse pulling this kind of shit, but anyway… just wanted to share this in case other people here feel strongly about not having their work gobbled up and regurgitated by an LLM.
What’s hysterical is that their shitty AI completely misinterpreted my post at times, and included bullet points in the PDF summary that actually contradict what I wrote.
I wouldn’t have known anything about this if the bot hadn’t auto-followed me when it scraped the contents of my link. I expect there are a lot more of these artificial hemorrhoids all over Mastodon, but here’s at least one to block.
I found out today about the following project called “iocaine“
I’m tracking who visits my site through umami and yesterday the same scraper I’ve been arguing with for weeks went totally crazy and found out about my wiki.
It’s been almost 24 hours and I haven’t seen anything from the scraper lately.
I do recommend that you read the documentation thoroughly, because if you have the wrong settings, it can even block search engine crawlers. I don’t really care about that, so I turned it off.
Not sure if this diverges too much from the original topic — but things like these scare the hell out of me. How are we supposed to even be in the Internet if everything we do can be stolen and replaced?
Just a sidebar to say I’ve been feeling exactly the same.. and it was only this past week that I discovered that web firewalls to block AI scrapers are a thing* . As happy as I am to discover ways to prevent AI bots webscraping, what kind of a world do we live in where I even need to think about that
I like the look of Anubis and I’ve seen a few people here using it so I’m going to give it a try. Like cashmere, I don’t actually care if it blocks search engines too so I’m happy for it to be quite restrictive.
( *i’m not actually sure what to call them since Web AI Firewall is an entirely different thing which seems like either a really clever/devious marketing ploy or unfortunately confusing misnaming )
Back on topic, that really sucks about AI bots on Mastodon. I’m glad I’ve stepped away from social media, but the web is going to become a very lonely place if people don’t feel they can share their links anymore.
I cannot begin to express how much I loathe the shitty humans who operate AI scrapers and botnets in general. Anubis is what I’d be using if I wasn’t already using Cloudflare’s free security options. I know Cloudflare gets a bad rap, and it bothers me as well that so much of the internet is filtered through Cloudflare, but the bot situation is actually insane. I have a WAF set up to issue a challenge to countries known for their bot attacks (China, Russia, India, etc.), and the number of failed challenges I see in my security logs is staggering (usually well over 2000 every single day). I’ve also had to set up custom blocks for certain request strings like “/wp-admin/,” as I get thousands of hits every day from bots based in the US and elsewhere looking for non-existent WordPress directories to exploit.