fun with robots.txt

I’m thinking of using robots.txt to exclude all commercial search engines, SEO crawlers, AI/ML scrapers, etc. Because there’s nothing wrong with Google, OpenAI, etc. that can’t be fixed with the sort of antitrust enforcement not seen since AT&T was broken up into the Baby Bells.

I know that honoring robots.txt is strictly voluntary, but I at least want to signal that I do not consent to having my work crawled by commercial robots.

I do, however, want to explicitly include useful bots or bots used by search engines that are at least somewhat friendly to the personal web. This is what I’ve got so far.

# My website is for people, not robots.

# These crawlers are either non-commercial
# or too small to dictate terms to indie website operators
User-agent: ia_archiver
Disallow:

User-Agent: search.marginalia.nu
Disallow:

User-agent: duckduckbot
Disallow:

User-agent: MojeekBot
Disallow:

User-agent: WibyBot
Disallow:

# AI crawlers deserve nothing.
User-agent: AdsBot-Google
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: AwarioRssBot
Disallow: /

User-agent: AwarioSmartBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: DataForSeoBot
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: ImagesiftBot
Disallow: /

User-agent: magpie-crawler
Disallow: /

User-agent: omgili
Disallow: /

User-agent: omgilibot
Disallow: /

User-agent: peer39_crawler
Disallow: /

User-agent: peer39_crawler/1.0
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: YouBot
Disallow: /

# a reasonable compromise for everybody else.
User-agent: *
Disallow: /feeds/
Disallow: /blog/
Disallow: /fiction/
Disallow: /media/
Disallow: /assets/
Disallow: /downloads/
Disallow: /bookmarks/

# an XML sitemap is available for legitimate crawlers
Sitemap: https://starbreaker.org/sitemap.xml

Can anybody suggest other ethical crawlers that I should be including?

(edit 0: updated sample robots.txt based on discussion with @BradE.)

(edit 1: updated sample robots.txt based on further discussion and a desire to block AI scrapers entirely while not shutting out all commercial search engines.)

7 Likes

The short answer is I would allow bots from at least Mojeek.com and the Common Crawl access and probably Bing. Mojeek is a privacy search engine, it is commercial, but it is trying very hard to build it’s own index to compete with Big Tech. Common Crawl’s index is either Open Source or Creative Commons (I can’t remember) but is used by some developing search engines as a starter index. Why Bing? I’m no fan of Microsoft, but the vast majority of search results on both DuckDuckGo and Ecosia come from Bing. If you cut Bing off you cut off those two meta-search engines as well. (Duckduckbot mainly just fetches favicons.)

The longer answer is that by banning all bots except for a small whitelist, you also cut off any new search engines that are under development from indexing your website. We need to be encouraging new search engines to fight the Google monopoly and people will not abandon Google unless there are viable alternatives. We need to nurture those.

Ack. This is my soapbox, I could go on forever but I’ll cut it short. For reference: Seirdy’s A look at search engines with their own indexes.

4 Likes

Thanks for answering, but I disagree with most of these.

I had already considered allowing Common Crawl, but they do nothing to stop the likes of OpenAI from using the data collected by CCBot.

I understand that DuckDuckGo still depends on Bing’s index, as does Ecosia, but I regard their dependence on Microsoft as their problem, not mine. If they’re serious about being independent, they should make their own crawlers and their own indexes.

I’ve added MojeekBot because they’re not big enough to dictate to anybody. I had considered giving Kagi full access as well, since they charge people instead of serving ads, but they do a poor job of documenting how they get data and what crawlers they use. For example, I don’t know if their own bot identifies as “Teclis” or “TeclisBot”.

I think you’re assuming that I’ll never update my robots.txt file as I learn about new search engines, but I suppose most operators don’t bother.

I really don’t like this kind of “we need” rhetoric, but that’s my problem, not yours. Nor do I think that people will abandon Google while it remains the dominant player, and its dominance doesn’t seem likely to change because of market forces alone. I think the US government will actually have to enforce existing antitrust law for the first time in decades and break Google up.

In the meantime, and as a compromise, I’ve changed my default rule as follows:

User-agent: *
Allow: /index.html
Disallow: /

Crawlers I haven’t approved get access to my homepage, and nothing else. If a new search engine wants their crawler to have wider access, they can email me and ask.

The Seirdy page about search engines is good. One of the better search engines I know about for the “small web” is Wiby. You can submit your site and allow WibyBot in the robots.txt file.

3 Likes

Thanks, @brisray. @BradE had mentioned Sierdy’s page, and I had been using it to improve my file. I’ve also added WibyBot as a legit crawler that gets unrestricted access.

And I now have a shell script for generating a XML sitemap.:

#!/usr/bin/env bash

LASTMOD_DATE=$(date -u -Iseconds)
ENTRIES=$((find site -name '*.html' && find site -name '*.xml') \
  | sort -u \
  | awk -F '\t' '{printf "\t<url>\n\t\t<loc>%s</loc>\n\t\t<lastmod>__DATE__</lastmod>\n\t\t<changefreq>weekly</changefreq>\n\t</url>\n", $1}' \
  | sed -e "s|site/|${URL}/|g" \
        -e "s|__DATE__|${LASTMOD_DATE}|g")

cat <<EOF
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    ${ENTRIES}
</urlset>
EOF

I did something similar, actually, but not as extensive:

User-agent: Googlebot
Disallow: /

User-agent: bingbot
Disallow: /

User-agent: Yandex
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Bytespider
Disallow: /
1 Like

Also fun fact: robots.txt doesn’t prevent search engines from indexing you. It only prevents crawling. If someone else links to you, they can still index. In order to prevent that too, you need X-Robots-Tag:

2 Likes

I think I had forgotten that. Thanks for reminding me.

It looks like while I can specify a X-Robots-Tag by filetype or user agent, I can’t easily specify one in .htaccess for a user agent and a particular directory.

Guess I’ll need to rent a VPS and go deeper into self-hosting. But then I’d be able to use the server config to automatically rickroll visitors from Hacker News, which might make the migration worth the hassle. :evil:

Here is a pretty extensive list of crawlers, scrapers and other agents both good and evil: Agents | Dark Visitors.

1 Like

I had found a bunch of definite AI/ML scrapers to block here.

Just refreshed the robots.txt file on Velvet’s website, and will promptly do the same for the others when I got the time. I did take some of your code, @starbreaker, I hope you don’t mind. I did credit you tho as well as add some choice comments lol. I’m also not very tech-literate so I didn’t touch on most of the stuff I didn’t know anything about yet

# This website is made for me and me only

# If you're here to use it as training data for your AI/ML models
# I disrepectfully ask you to go fuck yourself

# This is for non-commercial/indie site crawlers btw 
# Special Thanks to Starbreaker and BradE for indexing some of them
User-agent: ia_archiver
Disallow:

User-Agent: search.marginalia.nu
Disallow:

User-agent: duckduckbot
Disallow:

User-agent: MojeekBot
Disallow:

User-agent: WibyBot
Disallow:

# I hope your stocks plummet, Big Tech, fuck you
User-agent: AdsBot-Google
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: AwarioRssBot
Disallow: /

User-agent: AwarioSmartBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: DataForSeoBot
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: ImagesiftBot
Disallow: /

User-agent: magpie-crawler
Disallow: /

User-agent: omgili
Disallow: /

User-agent: omgilibot
Disallow: /

User-agent: peer39_crawler
Disallow: /

User-agent: peer39_crawler/1.0
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: YouBot
Disallow: /

I disallow all bots on my blog. I’d rather miss out on possible hits than have an excessively inclusive policy. I’m glad this thread exists because I’d be happy to allow indie-friendly search engines and such.

I’m curious about the benefits of disallowing indexing (as opposed to just disallowing crawling)… I can see why if you specifically don’t want to show up on search results, but if your issue is more about your data being gathered, would disallowing crawling be enough?

Here’s mine

User-agent: ia_archiver
User-agent: MojeekBot
User-agent: search.marginalia.nu
Disallow:

User-agent: AI2Bot
User-agent: Ai2Bot-Dolma
User-agent: aiHitBot
User-agent: Amazonbot
User-agent: anthropic-ai
User-agent: Applebot
User-agent: Applebot-Extended
User-agent: Brightbot 1.0
User-agent: Bytespider
User-agent: CCBot
User-agent: ChatGPT-User
User-agent: Claude-SearchBot
User-agent: Claude-User
User-agent: Claude-Web
User-agent: ClaudeBot
User-agent: cohere-ai
User-agent: cohere-training-data-crawler
User-agent: Cotoyogi
User-agent: Crawlspace
User-agent: Diffbot
User-agent: DuckAssistBot
User-agent: FacebookBot
User-agent: Factset_spyderbot
User-agent: FirecrawlAgent
User-agent: FriendlyCrawler
User-agent: Google-CloudVertexBot
User-agent: Google-Extended
User-agent: GoogleOther
User-agent: GoogleOther-Image
User-agent: GoogleOther-Video
User-agent: GPTBot
User-agent: iaskspider/2.0
User-agent: ICC-Crawler
User-agent: ImagesiftBot
User-agent: img2dataset
User-agent: imgproxy
User-agent: ISSCyberRiskCrawler
User-agent: Kangaroo Bot
User-agent: meta-externalagent
User-agent: Meta-ExternalAgent
User-agent: meta-externalfetcher
User-agent: Meta-ExternalFetcher
User-agent: MistralAI-User/1.0
User-agent: NovaAct
User-agent: OAI-SearchBot
User-agent: omgili
User-agent: omgilibot
User-agent: Operator
User-agent: PanguBot
User-agent: Perplexity-User
User-agent: PerplexityBot
User-agent: PetalBot
User-agent: QualifiedBot
User-agent: Scrapy
User-agent: SemrushBot-OCOB
User-agent: SemrushBot-SWA
User-agent: Sidetrade indexer bot
User-agent: TikTokSpider
User-agent: Timpibot
User-agent: VelenPublicWebCrawler
User-agent: Webzio-Extended
User-agent: wpbot
User-agent: YouBot
Disallow: /

Auto generated from the list of known hostile bots at ai.robots.txt

I’ve made mine as complete as I can, though I have also been pondering the overall blocking of bots, search engines be damned. It seems we no longer live in an age when robots.txt is respected, especially not by the main culprits we’re trying to block, but it’s something.

At some point it’ll be easier to just disallow everything except a few good actors - but for now, I’ll stick with blocking these 260 or so bad ones.

User-agent: .ai 
User-agent: AdsBot-Google
User-agent: Agentic
User-agent: AhrefsBot
User-agent: AI Article Writer
User-agent: AI Content Detector
User-agent: AI Dungeon
User-agent: AI Search Engine
User-agent: AI SEO Crawler
User-agent: AI Training
User-agent: AI Writer
User-agent: AI21 Labs
User-agent: AI2Bot
User-agent: Ai2Bot-Dolma
User-agent: AI2Bot-Dolma
User-agent: AIBot
User-agent: aiHitBot
User-agent: AIMatrix
User-agent: AISearchBot
User-agent: AITraining
User-agent: Alexa
User-agent: Alpha AI
User-agent: AlphaAI
User-agent: Amazon Bedrock
User-agent: Amazon Comprehend
User-agent: Amazon Lex
User-agent: Amazon Sagemaker
User-agent: Amazon Silk
User-agent: Amazon Textract
User-agent: Amazon-Kendra
User-agent: Amazonbot
User-agent: AmazonBot
User-agent: Amelia
User-agent: AndersPinkBot
User-agent: Anthropic
User-agent: anthropic-ai
User-agent: AnyPicker
User-agent: Anyword
User-agent: Applebot
User-agent: Applebot-Extended
User-agent: Aria Browse
User-agent: Articoolo
User-agent: Automated Writer
User-agent: AwarioBot
User-agent: AwarioRssBot
User-agent: AwarioSmartBot
User-agent: Azure
User-agent: BardBot
User-agent: BingBot
User-agent: Brave Leo
User-agent: Brightbot 1.0
User-agent: ByteDance
User-agent: Bytespider
User-agent: CatBoost
User-agent: CC-Crawler
User-agent: CCBot
User-agent: ChatGLM
User-agent: ChatGPT-User
User-agent: ChatGPT-User/2.0
User-agent: Chinchilla
User-agent: Claude
User-agent: Claude-SearchBot
User-agent: Claude-User
User-agent: claude-web
User-agent: Claude-Web
User-agent: ClaudeBot
User-agent: ClearScope
User-agent: Cohere
User-agent: cohere-ai
User-agent: cohere-training-data-crawler
User-agent: Common Crawl
User-agent: CommonCrawl
User-agent: Content Harmony
User-agent: Content King
User-agent: Content Optimizer
User-agent: Content Samurai
User-agent: ContentAtScale
User-agent: ContentBot
User-agent: Contentedge
User-agent: Conversion AI
User-agent: Copilot
User-agent: CopyAI
User-agent: Copymatic
User-agent: Copyscape
User-agent: Cotoyogi
User-agent: CrawlQ AI
User-agent: Crawlspace
User-agent: Crew AI
User-agent: CrewAI
User-agent: DALL-E
User-agent: DataForSeoBot
User-agent: DataProvider
User-agent: DeepAI
User-agent: DeepL
User-agent: DeepMind
User-agent: DeepSeek
User-agent: Diffbot
User-agent: diffbot
User-agent: Doubao AI
User-agent: DuckAssistBot
User-agent: FacebookBot
User-agent: Facebookbot
User-agent: FacebookExternalHit
User-agent: Factset_spyderbot
User-agent: Falcon
User-agent: Firecrawl
User-agent: FirecrawlAgent
User-agent: Flyriver
User-agent: Frase AI
User-agent: FriendlyCrawler
User-agent: Gemini
User-agent: Gemma
User-agent: GenAI
User-agent: Genspark
User-agent: Gigabot
User-agent: GLM
User-agent: Google-CloudVertexBot
User-agent: Google-Extended
User-agent: GoogleOther
User-agent: GoogleOther-Image
User-agent: GoogleOther-Video
User-agent: Goose
User-agent: GPT
User-agent: GPTBot
User-agent: Grammarly
User-agent: Grendizer
User-agent: Grok
User-agent: GT Bot
User-agent: GTBot
User-agent: Hemingway Editor
User-agent: Hugging Face
User-agent: Hypotenuse AI
User-agent: iaskspider
User-agent: iaskspider/2.0
User-agent: ICC-Crawler
User-agent: ImageGen
User-agent: ImagesiftBot
User-agent: img2dataset
User-agent: imgproxy
User-agent: Inferkit
User-agent: INK Editor
User-agent: INKforall
User-agent: IntelliSeek
User-agent: ISSCyberRiskCrawler
User-agent: JasperAI
User-agent: Kafkai
User-agent: Kangaroo
User-agent: Kangaroo Bot
User-agent: Keyword Density AI
User-agent: Knowledge
User-agent: KomoBot
User-agent: LinkedInBot
User-agent: LLaMA
User-agent: LLMs
User-agent: magpie-crawler
User-agent: MarketMuse
User-agent: Meltwater
User-agent: Meta AI
User-agent: Meta-AI
User-agent: Meta-External
User-agent: meta-externalagent
User-agent: Meta-ExternalAgent
User-agent: meta-externalfetcher
User-agent: Meta-ExternalFetcher
User-agent: MetaAI
User-agent: MetaTagBot
User-agent: Mistral
User-agent: MistralAI-User/1.0
User-agent: Narrative
User-agent: NeevaBot
User-agent: Neural Text
User-agent: NeuralSEO
User-agent: Nova Act
User-agent: NovaAct
User-agent: Nutch
User-agent: OAI-SearchBot
User-agent: Omgili
User-agent: omgili
User-agent: Omgilibot
User-agent: omgilibot
User-agent: Open AI
User-agent: OpenAI
User-agent: OpenBot
User-agent: OpenText AI
User-agent: Operator
User-agent: Outwrite
User-agent: Page Analyzer AI
User-agent: PanguBot
User-agent: Paperlibot
User-agent: Paraphraser.io
User-agent: peer39_crawler
User-agent: Perplexity
User-agent: Perplexity-User
User-agent: PerplexityBot
User-agent: PetalBot
User-agent: Petalbot
User-agent: Phindbot
User-agent: PiplBot
User-agent: prefetch-proxy
User-agent: ProWritingAid
User-agent: psbot
User-agent: python-requests
User-agent: QualifiedBot
User-agent: QuillBot
User-agent: RobotSpider
User-agent: Robozilla
User-agent: Rytr
User-agent: SaplingAI
User-agent: Scalenut
User-agent: Scraper
User-agent: Scrapy
User-agent: ScriptBook
User-agent: Seekr
User-agent: SemrushBot-OCOB
User-agent: SemrushBot-SWA
User-agent: SentiBot
User-agent: sentibot
User-agent: Sentibot
User-agent: SEO Content Machine
User-agent: SEO Robot
User-agent: Sidetrade
User-agent: Sidetrade indexer bot
User-agent: Simplified AI
User-agent: Sitefinity
User-agent: Skydancer
User-agent: SlickWrite
User-agent: Sonic
User-agent: Spin Rewriter
User-agent: Spinbot
User-agent: Stability
User-agent: StableDiffusionBot
User-agent: Sudowrite
User-agent: Super Agent
User-agent: Surfer AI
User-agent: Teoma
User-agent: Text Blaze
User-agent: TextCortex
User-agent: The Knowledge AI
User-agent: TikTokSpider
User-agent: TimpiBot
User-agent: Timpibot
User-agent: TurnitinBot
User-agent: VelenPublicWebCrawler
User-agent: Vidnami AI
User-agent: Webzio
User-agent: Webzio-Extended
User-agent: webzio-extended
User-agent: Whisper
User-agent: WordAI
User-agent: Wordtune
User-agent: WormsGTP
User-agent: wpbot
User-agent: WPBot
User-agent: Writecream
User-agent: WriterZen
User-agent: Writescope
User-agent: Writesonic
User-agent: xAI
User-agent: xBot
User-agent: YouBot
User-agent: Youbot
User-agent: Zero GTP
User-agent: Zerochat
User-agent: Zhipu
User-agent: Zimm
Disallow: /
Disallow: *

User-agent: *
DisallowAITraining: /
DisallowAITraining: *
Content-Usage: ai=n

edit: added a bunch of new ones after some quick online searching

1 Like