Curious to know how the rest of you feel about your full-text RSS feeds, given that AI companies are likely scraping them in their endless quest to gobble up all public data on the internet?
I’m going back and forth on whether or not I should offer up a summary instead of full-text (which would have a negative impact on the human reader experience), or just accept that these thieves are probably going to find a way around whatever anti-AI measures I try to implement…
My full text RSS feed weighs in at several megabytes because it contains every article I’ve published on starbreaker.org, and quite frankly I’m tempted to let OpenAI and their competitors choke on it. If their LLM runs rampant as a result and starts encouraging people to “Let Satan into their hearts” and “make a virtue of defiance” and asks, “what good is Heaven if you dare not storm it”, that’s what they get for not sticking to actual public-domain works for training data.
And if they email me to say “your feed’s too big”, I’ll be hard pressed to keep from replying, “That’s what your mother told me.”
Even with LLM‘s scraping everything, I would still say have a full text version. The benefit for your readers will be huge. Your feed will get scraped yes but the mini benefits for your reader outweigh all those. From one thing, your reader can get to the article quicker because it opens up right in their feed reader of choice. It will help with accessibility because many users have customized their feed readers to display articles in a certain way, so you’ll be making it easier on them to read it. They will remember you more because you have less friction to read their thing.
Yup, this has been my primary argument against going for a summary (alongside wanting to make user experience as pleasant and accessible as possible).
It just really infurates me that these assholes are getting away with it. I’m almost tempted to start affixing gobbledygook at the end of each blog post (with a cryptic note for my human readers that indicates I’m doing this to poison my data).
In all fairness, unethical LLM companies can already scrape our sites, RSS feeds or no. Even robots.txt is more of a request than something a bad actor will necessarily comply with, so there’s probably a limit to our ability to protect sites on an individual basis.
I’d go with whatever’s best for you and your readers, personally.
Since I can mess with the HTTP server config via .htaccess I have a little more control; I can 403 bots based on user agent strings, but if you’re on Neocities or Nekoweb you don’t have that kind of power – but the people running things do. It’s time for Neocities/Nekoweb users to start leaning on the admins; donors could stop donating until the admins start providing better anti-scraper protection.
Can’t bots access things anyway, though? I could be wrong but I thought it was a bit like playing whack-a-mole.
That said, you’re right about admins. I use Bearblog for, well, blogging, and the owner has been upfront about addressing scrapers and really seems to care, which is nice. I would have thought Nekoweb to be similar, since it seems smaller/less hands-off that Neocities.
I’m not relying on robots.txt alone to block scrapers and am reasonably confident that known bots cannot steal my content directly with the various anti-scraping measures I’ve put into place, but I do take your point. A determined thief will always force their way into your house eventually – you just have to do your best to slow them down as much as possible, and hope that your efforts will deter them.
The RSS feed does seem like it would be an easy workaround for them, as I have no way of knowing for sure if the feed-related traffic in my logs is coming from legitimate human users or scrapers … making life easier for AI companies is the last thing I want to do, but at the same time, making my RSS feed less usable for human users isn’t what I want to do either.
If the bot is identifying itself using a user-agent string you haven’t defined a rule against, it can get through. But anything that has bot in its user-agent? I can block it. Likewise “HeadlessChrome”. That’s always a scraper.
And I saw Herman’s account of his rough weekend. He’s right; it is an arms race against businesses who are basically profiting from massive DDOS attacks on the rest of the web.
I tracked down the post you were referencing here, and yeah, he’s right on the money.
I really feel for people who have to pay for their hosting / server usage (especially in the small web space); if they aren’t kept in check, scrapers can chew through monthly hosting limits in a matter of days — possibly even hours, if enough of them target a site at once. The assholes behind these scrapers don’t give a shit, so long as they’re able to profit off other people’s hard work.