im not well versed on the subject of AI. I tend to think it is a tool that can be used for good or bad yada yada. but i have seen a lot of theft lately using AI. ive seen stolen art, writing, animations, youtube videos, even people’s voices and faces. i know a tiny bit about robots.txt and various defensive/offensive filters used to confuse or poison datasets but i was hoping to hear your thoughts and advice. i have a huge backlog of paintings and stories that i want to share online but im anxious about doing so.
This is a systemic problem not particularly amenable to individual action. While I can block known user agents at the server level instead of just using robots.txt, even that’s no guarantee.
What we need is for governments to crack down hard on corporations that train generative AI models using mass automated copyright infringement. They need to hold people like Elon Musk, Sam Altman, and Mark Zuckerberg criminally liable, and after proving them guilty at a jury trial throw them into solitary confinement and seize their wealth under criminal asset forfeiture.
That’s not likely to happen in the US. Apparently “free enterprise” means letting billionaires and corporations ignore the law.
I used to be sort of obsessed with trying to thwart scraping and AI training, but now that AI companies have basically vacuumed up the entire Internet, I feel like any new content is a drop in the bucket and will only contribute to the rapidly flattening curve of improvement to modern models.
That being said, I do have a few suggestions if you want to protect your own work. The first one is good ol’ Glaze, which you may have already heard of. I haven’t really used it myself since it doesn’t (or at least didn’t at the time I used it, maybe they’ve improved it) work well on flat, cartoony art styles like mine. It has been around for a while, but I don’t know how widespread its use is or how effective it has actually proven to be. There’s also the related Nightshade project.
Other than those options, I came up with a potential strategy to specifically impede automated scraping a couple years back, although I got very little interest in it from others at the time and never put it into practice myself. Basically, it involved storing images or other data in a scrambled or encrypted format and then using JavaScript to unscramble or decrypt the data at page load time and render it to a canvas element or otherwise insert it into the document. It wouldn’t prevent the content from being saved manually, but I figured it would at least thwart any bot that just blindly downloaded content from web servers without actually loading the page. Unfortunately(?), I imagine it would also prevent any of the protected content from showing up in search results or being embedded outside of its original page. Unlike Glaze and Nightshade, it also wouldn’t prevent a targeted attack by a real person. However, it could be a useful strategy for protecting text content.
As far as text content goes specifically, putting the text inside of an image could make it less likely to end up as AI training data. I also discovered that it’s possible to scramble the glyphs of a font with CSS and have text that looks like nonsense inside the HTML file, but is still visually readable when it gets rendered. Both of those things would hinder accessibility, though, and could still be scraped by a bot that uses OCR (if that’s even a thing).
You don’t.
If the person scraping your stuff for training data is acting in bad faith you likely can’t stop them.
robots.txt
If the person scraping the web is operating in good faith this is all you will need to do. If they’re not they will ignore your robots.txt and scrape your site anyways.
poisoning
As soon as a new technique for poisoning a dataset is discovered by someone training a model they will write code to detect poisoned images and set them aside. Once they find time they’ll find a method of undoing the poison, and will then consume your images anyways. Poison, much like encryption, must work forever. If at any point in the future it can be undone all of those poisoned images will be trained on. The popular techniques I saw employed when this become a concern (nightshade, webglaze) were detectable in days and defeated in weeks. I have no idea if there are newer methods of poisoning that haven’t been defeated yet, but this is moot as soon as they are your stuff will be trained on.
banning scrapers when you detect them
IP addresses are cheap and bad actors are always changing their techniques.
Custom solutions
This may work simply due to security through obscurity, but it isn’t guaranteed. If your technique becomes common folks scraping will adapt. This will require you to selfhost and implement your own solutions as well. You’d have to stop sharing all of your creations on other people’s computers (read: all social media sites, all of the fediverse, and most indie sites). If you want to avoid feeding the machine your techniques have to work every time, theirs only need to work once. Given that you need to make the image / text display on another persons computer, you’re at a massive disadvantage.
There isn’t anything you can really do about it. Even if legislation is passed there will absolutely be unscrupulous actors still doing it. It’s unenforceable. To go after someone you’d need to have enough money to sue them, figure who they are, and hope they’re in a jurisdiction where you have legal standing. Not to mention prove your work was trained on by any one particular model.
I’ve just let go. My personal data has zero impact on the quality of an AI model when they’re chewing through terabytes of data anyhow. Not to mention the amount of data (and compute, for that matter) required to train up new models has come down massively in the last two-three years.
If you want to block AI crawlers via robots.txt, you implement the appropriate rules to your robots.txt file to disallow their access. This will prevent the most common AI crawlers from scraping your website content.
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
This will prevent the most common AI crawlers from scraping your website content.
Also, you could implement Cloudflare’s Bot Protection to block AI crawlers entirely. Cloudflare offers a dedicated firewall rule designed specifically to stop AI scrapers from accessing your site. You can read more about it here: Cloudflare’s Firewall for AI.
If you want to take it a step further, you can also block these crawlers at the server level by using their User-Agent in your .htaccess file. This adds an extra layer of protection, especially against crawlers that might ignore the robots.txt directives.
However, the most effective and long-term solution is to enable Cloudflare’s WAF (Web Application Firewall) combined with their AI scraping protection. This will significantly reduce the chances of AI crawlers accessing your content, making it much harder for them to scrape your artwork, written content, or any other creative work.
Blocking AI crawlers is becoming increasingly important as content scraping rises, and taking these steps can help protect your website from unauthorized data harvesting.
thank you for bringing this up! i hadnt even begun to consider how to post poetry online. i will look into this, too.
thank you, i think i might agree with this stance. i will still follow everyones suggestions and continue researching but i feel less afraid, now. thank you everyone for your help! :)
I simply don’t post anything I consider valuable online at all. Any I create that I don’t want scraped I keep private, on my own computer and backup drives. Starbreaker is right – the big scumbags are currently just allowed to steal.
Nightshade and robots.txt are basically the only somewhat functional tools I know of.
In the end, I take @queuetea’s view. I don’t do visual art, just writing, which is basically impossible to obfuscate in the way nightshade does if you want it to be at all accessible to screen readers, RSS, etc.
For me, getting ingested by AI is simply the price of admission. Just as I accept that I can’t control what human readers do with my writing, I accept that I can’t control what humans will instruct machines to do with my writing:
- If you want to copy my posts into Word and read it in a different font, I can’t stop you.
- If you want to make blackout poetry of my hot take, I can’t do anything about that.
- If you want to make a word cloud, pop off I guess.
- If you want to do a bunch of matrix multiplications to train your weird AI…
If you don’t give a damn about getting indexed on the search engines (and you probably shouldn’t), here’s a solution I’ve tried:
Create a hidden <a href="/botpot.php" style:"display:none">If you go here you will get your IP blocked</a>
link at the top of your every page (or at least on your home page). A visit to /botpot.php shows a message saying something like:
This page is a bot honeypot. Your IP was blocked for visiting this page. Send an email to [obfuscated e-mail address] to have it unblocked.
A visit at /botpot.php also adds the IP of the visitor to a list of blocked IPs in .htaccess.
And yeah, this solution probably would be compatible with being indexed by search engines too. Use robots.txt to disallow /botpot.php. If a bot goes there despite this, you definitely know it’s not a good one.
Now I guess you could disallow the page showing your art too and be a bit more sure that it won’t be slurped by AI bots. I guess it’s a matter of if it tries to visit /botpot.php first…