fun with robots.txt

I’m thinking of using robots.txt to exclude all commercial search engines, SEO crawlers, AI/ML scrapers, etc. Because there’s nothing wrong with Google, OpenAI, etc. that can’t be fixed with the sort of antitrust enforcement not seen since AT&T was broken up into the Baby Bells.

I know that honoring robots.txt is strictly voluntary, but I at least want to signal that I do not consent to having my work crawled by commercial robots.

I do, however, want to explicitly include useful bots or bots used by search engines that are at least somewhat friendly to the personal web. This is what I’ve got so far.

# My website is for people, not robots.

# These crawlers are either non-commercial
# or too small to dictate terms to indie website operators
User-agent: ia_archiver
Disallow:

User-Agent: search.marginalia.nu
Disallow:

User-agent: duckduckbot
Disallow:

User-agent: MojeekBot
Disallow:

User-agent: WibyBot
Disallow:

# AI crawlers deserve nothing.
User-agent: AdsBot-Google
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: AwarioRssBot
Disallow: /

User-agent: AwarioSmartBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: DataForSeoBot
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: ImagesiftBot
Disallow: /

User-agent: magpie-crawler
Disallow: /

User-agent: omgili
Disallow: /

User-agent: omgilibot
Disallow: /

User-agent: peer39_crawler
Disallow: /

User-agent: peer39_crawler/1.0
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: YouBot
Disallow: /

# a reasonable compromise for everybody else.
User-agent: *
Disallow: /feeds/
Disallow: /blog/
Disallow: /fiction/
Disallow: /media/
Disallow: /assets/
Disallow: /downloads/
Disallow: /bookmarks/

# an XML sitemap is available for legitimate crawlers
Sitemap: https://starbreaker.org/sitemap.xml

Can anybody suggest other ethical crawlers that I should be including?

(edit 0: updated sample robots.txt based on discussion with @BradE.)

(edit 1: updated sample robots.txt based on further discussion and a desire to block AI scrapers entirely while not shutting out all commercial search engines.)

5 Likes

The short answer is I would allow bots from at least Mojeek.com and the Common Crawl access and probably Bing. Mojeek is a privacy search engine, it is commercial, but it is trying very hard to build it’s own index to compete with Big Tech. Common Crawl’s index is either Open Source or Creative Commons (I can’t remember) but is used by some developing search engines as a starter index. Why Bing? I’m no fan of Microsoft, but the vast majority of search results on both DuckDuckGo and Ecosia come from Bing. If you cut Bing off you cut off those two meta-search engines as well. (Duckduckbot mainly just fetches favicons.)

The longer answer is that by banning all bots except for a small whitelist, you also cut off any new search engines that are under development from indexing your website. We need to be encouraging new search engines to fight the Google monopoly and people will not abandon Google unless there are viable alternatives. We need to nurture those.

Ack. This is my soapbox, I could go on forever but I’ll cut it short. For reference: Seirdy’s A look at search engines with their own indexes.

4 Likes

Thanks for answering, but I disagree with most of these.

I had already considered allowing Common Crawl, but they do nothing to stop the likes of OpenAI from using the data collected by CCBot.

I understand that DuckDuckGo still depends on Bing’s index, as does Ecosia, but I regard their dependence on Microsoft as their problem, not mine. If they’re serious about being independent, they should make their own crawlers and their own indexes.

I’ve added MojeekBot because they’re not big enough to dictate to anybody. I had considered giving Kagi full access as well, since they charge people instead of serving ads, but they do a poor job of documenting how they get data and what crawlers they use. For example, I don’t know if their own bot identifies as “Teclis” or “TeclisBot”.

I think you’re assuming that I’ll never update my robots.txt file as I learn about new search engines, but I suppose most operators don’t bother.

I really don’t like this kind of “we need” rhetoric, but that’s my problem, not yours. Nor do I think that people will abandon Google while it remains the dominant player, and its dominance doesn’t seem likely to change because of market forces alone. I think the US government will actually have to enforce existing antitrust law for the first time in decades and break Google up.

In the meantime, and as a compromise, I’ve changed my default rule as follows:

User-agent: *
Allow: /index.html
Disallow: /

Crawlers I haven’t approved get access to my homepage, and nothing else. If a new search engine wants their crawler to have wider access, they can email me and ask.

The Seirdy page about search engines is good. One of the better search engines I know about for the “small web” is Wiby. You can submit your site and allow WibyBot in the robots.txt file.

3 Likes

Thanks, @brisray. @BradE had mentioned Sierdy’s page, and I had been using it to improve my file. I’ve also added WibyBot as a legit crawler that gets unrestricted access.

And I now have a shell script for generating a XML sitemap.:

#!/usr/bin/env bash

LASTMOD_DATE=$(date -u -Iseconds)
ENTRIES=$((find site -name '*.html' && find site -name '*.xml') \
  | sort -u \
  | awk -F '\t' '{printf "\t<url>\n\t\t<loc>%s</loc>\n\t\t<lastmod>__DATE__</lastmod>\n\t\t<changefreq>weekly</changefreq>\n\t</url>\n", $1}' \
  | sed -e "s|site/|${URL}/|g" \
        -e "s|__DATE__|${LASTMOD_DATE}|g")

cat <<EOF
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    ${ENTRIES}
</urlset>
EOF

I did something similar, actually, but not as extensive:

User-agent: Googlebot
Disallow: /

User-agent: bingbot
Disallow: /

User-agent: Yandex
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Bytespider
Disallow: /
1 Like

Also fun fact: robots.txt doesn’t prevent search engines from indexing you. It only prevents crawling. If someone else links to you, they can still index. In order to prevent that too, you need X-Robots-Tag:

2 Likes

I think I had forgotten that. Thanks for reminding me.

It looks like while I can specify a X-Robots-Tag by filetype or user agent, I can’t easily specify one in .htaccess for a user agent and a particular directory.

Guess I’ll need to rent a VPS and go deeper into self-hosting. But then I’d be able to use the server config to automatically rickroll visitors from Hacker News, which might make the migration worth the hassle. :evil:

Here is a pretty extensive list of crawlers, scrapers and other agents both good and evil: Agents | Dark Visitors.

1 Like

I had found a bunch of definite AI/ML scrapers to block here.