how to prevent AI art theft

mousse · March 8, 2025, 12:11am

im not well versed on the subject of AI. I tend to think it is a tool that can be used for good or bad yada yada. but i have seen a lot of theft lately using AI. ive seen stolen art, writing, animations, youtube videos, even people’s voices and faces. i know a tiny bit about robots.txt and various defensive/offensive filters used to confuse or poison datasets but i was hoping to hear your thoughts and advice. i have a huge backlog of paintings and stories that i want to share online but im anxious about doing so.

starbreaker · March 8, 2025, 1:54am

This is a systemic problem not particularly amenable to individual action. While I can block known user agents at the server level instead of just using robots.txt, even that’s no guarantee.

What we need is for governments to crack down hard on corporations that train generative AI models using mass automated copyright infringement. They need to hold people like Elon Musk, Sam Altman, and Mark Zuckerberg criminally liable, and after proving them guilty at a jury trial throw them into solitary confinement and seize their wealth under criminal asset forfeiture.

That’s not likely to happen in the US. Apparently “free enterprise” means letting billionaires and corporations ignore the law.

Spots · March 8, 2025, 4:22am

I used to be sort of obsessed with trying to thwart scraping and AI training, but now that AI companies have basically vacuumed up the entire Internet, I feel like any new content is a drop in the bucket and will only contribute to the rapidly flattening curve of improvement to modern models.

That being said, I do have a few suggestions if you want to protect your own work. The first one is good ol’ Glaze, which you may have already heard of. I haven’t really used it myself since it doesn’t (or at least didn’t at the time I used it, maybe they’ve improved it) work well on flat, cartoony art styles like mine. It has been around for a while, but I don’t know how widespread its use is or how effective it has actually proven to be. There’s also the related Nightshade project.

Other than those options, I came up with a potential strategy to specifically impede automated scraping a couple years back, although I got very little interest in it from others at the time and never put it into practice myself. Basically, it involved storing images or other data in a scrambled or encrypted format and then using JavaScript to unscramble or decrypt the data at page load time and render it to a canvas element or otherwise insert it into the document. It wouldn’t prevent the content from being saved manually, but I figured it would at least thwart any bot that just blindly downloaded content from web servers without actually loading the page. Unfortunately(?), I imagine it would also prevent any of the protected content from showing up in search results or being embedded outside of its original page. Unlike Glaze and Nightshade, it also wouldn’t prevent a targeted attack by a real person. However, it could be a useful strategy for protecting text content.

As far as text content goes specifically, putting the text inside of an image could make it less likely to end up as AI training data. I also discovered that it’s possible to scramble the glyphs of a font with CSS and have text that looks like nonsense inside the HTML file, but is still visually readable when it gets rendered. Both of those things would hinder accessibility, though, and could still be scraped by a bot that uses OCR (if that’s even a thing).

queuetea · March 8, 2025, 4:44am

You don’t.

If the person scraping your stuff for training data is acting in bad faith you likely can’t stop them.

robots.txt

If the person scraping the web is operating in good faith this is all you will need to do. If they’re not they will ignore your robots.txt and scrape your site anyways.

poisoning

As soon as a new technique for poisoning a dataset is discovered by someone training a model they will write code to detect poisoned images and set them aside. Once they find time they’ll find a method of undoing the poison, and will then consume your images anyways. Poison, much like encryption, must work forever. If at any point in the future it can be undone all of those poisoned images will be trained on. The popular techniques I saw employed when this become a concern (nightshade, webglaze) were detectable in days and defeated in weeks. I have no idea if there are newer methods of poisoning that haven’t been defeated yet, but this is moot as soon as they are your stuff will be trained on.

banning scrapers when you detect them

IP addresses are cheap and bad actors are always changing their techniques.

Custom solutions

This may work simply due to security through obscurity, but it isn’t guaranteed. If your technique becomes common folks scraping will adapt. This will require you to selfhost and implement your own solutions as well. You’d have to stop sharing all of your creations on other people’s computers (read: all social media sites, all of the fediverse, and most indie sites). If you want to avoid feeding the machine your techniques have to work every time, theirs only need to work once. Given that you need to make the image / text display on another persons computer, you’re at a massive disadvantage.

There isn’t anything you can really do about it. Even if legislation is passed there will absolutely be unscrupulous actors still doing it. It’s unenforceable. To go after someone you’d need to have enough money to sue them, figure who they are, and hope they’re in a jurisdiction where you have legal standing. Not to mention prove your work was trained on by any one particular model.

I’ve just let go. My personal data has zero impact on the quality of an AI model when they’re chewing through terabytes of data anyhow. Not to mention the amount of data (and compute, for that matter) required to train up new models has come down massively in the last two-three years.

cpvr · March 8, 2025, 6:04am

If you want to block AI crawlers via robots.txt, you implement the appropriate rules to your robots.txt file to disallow their access. This will prevent the most common AI crawlers from scraping your website content.

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

This will prevent the most common AI crawlers from scraping your website content.

Also, you could implement Cloudflare’s Bot Protection to block AI crawlers entirely. Cloudflare offers a dedicated firewall rule designed specifically to stop AI scrapers from accessing your site. You can read more about it here: Cloudflare’s Firewall for AI.

If you want to take it a step further, you can also block these crawlers at the server level by using their User-Agent in your .htaccess file. This adds an extra layer of protection, especially against crawlers that might ignore the robots.txt directives.

However, the most effective and long-term solution is to enable Cloudflare’s WAF (Web Application Firewall) combined with their AI scraping protection. This will significantly reduce the chances of AI crawlers accessing your content, making it much harder for them to scrape your artwork, written content, or any other creative work.

Blocking AI crawlers is becoming increasingly important as content scraping rises, and taking these steps can help protect your website from unauthorized data harvesting.

mousse · March 8, 2025, 2:48pm

thank you for bringing this up! i hadnt even begun to consider how to post poetry online. i will look into this, too.

thank you, i think i might agree with this stance. i will still follow everyones suggestions and continue researching but i feel less afraid, now. thank you everyone for your help! :)

ConcreteLunch · March 10, 2025, 3:38am

I simply don’t post anything I consider valuable online at all. Any I create that I don’t want scraped I keep private, on my own computer and backup drives. Starbreaker is right – the big scumbags are currently just allowed to steal.

convexer · March 10, 2025, 11:55pm

Nightshade and robots.txt are basically the only somewhat functional tools I know of.

In the end, I take @queuetea’s view. I don’t do visual art, just writing, which is basically impossible to obfuscate in the way nightshade does if you want it to be at all accessible to screen readers, RSS, etc.

For me, getting ingested by AI is simply the price of admission. Just as I accept that I can’t control what human readers do with my writing, I accept that I can’t control what humans will instruct machines to do with my writing:

If you want to copy my posts into Word and read it in a different font, I can’t stop you.
If you want to make blackout poetry of my hot take, I can’t do anything about that.
If you want to make a word cloud, pop off I guess.
If you want to do a bunch of matrix multiplications to train your weird AI…

mikael · March 11, 2025, 9:20am

If you don’t give a damn about getting indexed on the search engines (and you probably shouldn’t), here’s a solution I’ve tried:
Create a hidden <a href="/botpot.php" style:"display:none">If you go here you will get your IP blocked</a> link at the top of your every page (or at least on your home page). A visit to /botpot.php shows a message saying something like:

This page is a bot honeypot. Your IP was blocked for visiting this page. Send an email to [obfuscated e-mail address] to have it unblocked.

A visit at /botpot.php also adds the IP of the visitor to a list of blocked IPs in .htaccess.

mikael · March 11, 2025, 11:16am

And yeah, this solution probably would be compatible with being indexed by search engines too. Use robots.txt to disallow /botpot.php. If a bot goes there despite this, you definitely know it’s not a good one.

Now I guess you could disallow the page showing your art too and be a bit more sure that it won’t be slurped by AI bots. I guess it’s a matter of if it tries to visit /botpot.php first…

ConcreteLunch · May 17, 2025, 12:44am

I actually paid for WordFence on my Wordpress site for a couple of years. It was about $100 a year. I blocked every country except the US and England. Honestly, no one outside those countries really would have any interest in reading my personal blog.

The number of blocked bots my pitiful little personal blog is hit by on a daily basis is staggering.

CaffeineAndLasers · May 17, 2025, 10:06am

@ConcreteLunch

I’m from Australia and I read your blog from time to time ;)

ConcreteLunch · May 17, 2025, 2:29pm

Yeah, eventually I decided blocking like that was dumb, even with the insane number of bullshit bots visiting. I’d love to block only AI but that seems like a lost cause.

CaffeineAndLasers · May 17, 2025, 10:47pm

I don’t know how difficult it is, but have you looked into Anubis

starbreaker · May 18, 2025, 1:57am

I’ve looked into it. I’d have to start running my website on a VPS to use it, unless Nearly Free Speech decided to set it up for every website they host.

CaffeineAndLasers · May 18, 2025, 8:38am

That is why I have so far been fine with leaving my website on GitHub Pages for now. I figure Microsoft has caused the botting problem (or at least partially), so I am happy to let github suck up all the bot activity on my site

ookamij64 · June 21, 2025, 11:26pm

As someone that has their half baked site hosted on both Neocities and Nekoweb (former explicitly stating they have no plans to block such scrapers for the moment), I had considered that option as well but that meant hosting it myself which I have no knowledge on how that works.

While I agree with one saying that there’s not much to do about it, the way it’s worded sounds like you’re SOL which I don’t buy personally. Again, just being aware of what you post if you’re able to get away with the bare minimum. Watermark your works with the AI poison, putting the bot text to loosen up the amount of fishy traffic there. I’m kind of fed up of seeing answers that feel like “nothing you can do about it” when there really is at least something. The former just suggests there’s 0% chance of impact based on what I do… (Now I’m just moping about AI taking over, not exactly anti but so frustrated of the generative AI being placed on a high pedestal)

starbreaker · June 22, 2025, 2:26am

I never said there’s nothing you can do about it. What I said was that this is a systemic issue, therefore individual action isn’t enough. Maybe if we could DDOS OpenAI into the ground, but that’s illegal unless you’re a corporation or nation-state.

ookamij64 · June 22, 2025, 7:42pm

I apologize, I think that was in reference to someone else’s response and they weren’t active for a good while… And you’re absolutely right. And it’s unfortunate that unless your a big corporation with the pocket change to do so, you’re very limited on options.