I’m thinking of using robots.txt
to exclude all commercial search engines, SEO crawlers, AI/ML scrapers, etc. Because there’s nothing wrong with Google, OpenAI, etc. that can’t be fixed with the sort of antitrust enforcement not seen since AT&T was broken up into the Baby Bells.
I know that honoring robots.txt
is strictly voluntary, but I at least want to signal that I do not consent to having my work crawled by commercial robots.
I do, however, want to explicitly include useful bots or bots used by search engines that are at least somewhat friendly to the personal web. This is what I’ve got so far.
# My website is for people, not robots.
# These crawlers are either non-commercial
# or too small to dictate terms to indie website operators
User-agent: ia_archiver
Disallow:
User-Agent: search.marginalia.nu
Disallow:
User-agent: duckduckbot
Disallow:
User-agent: MojeekBot
Disallow:
User-agent: WibyBot
Disallow:
# AI crawlers deserve nothing.
User-agent: AdsBot-Google
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: AwarioRssBot
Disallow: /
User-agent: AwarioSmartBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: DataForSeoBot
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: GPTBot
Disallow: /
User-agent: ImagesiftBot
Disallow: /
User-agent: magpie-crawler
Disallow: /
User-agent: omgili
Disallow: /
User-agent: omgilibot
Disallow: /
User-agent: peer39_crawler
Disallow: /
User-agent: peer39_crawler/1.0
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: YouBot
Disallow: /
# a reasonable compromise for everybody else.
User-agent: *
Disallow: /feeds/
Disallow: /blog/
Disallow: /fiction/
Disallow: /media/
Disallow: /assets/
Disallow: /downloads/
Disallow: /bookmarks/
# an XML sitemap is available for legitimate crawlers
Sitemap: https://starbreaker.org/sitemap.xml
Can anybody suggest other ethical crawlers that I should be including?
(edit 0: updated sample robots.txt
based on discussion with @BradE.)
(edit 1: updated sample robots.txt
based on further discussion and a desire to block AI scrapers entirely while not shutting out all commercial search engines.)