robots.txt: The Gatekeeper of Your Site

A tiny text file with enormous power

There's a file sitting at the root of your website — yoursite.com/robots.txt — that tells every search engine crawler what it can and can't access. It's been around since 1994, it's just plain text, and a single wrong line can make your entire site disappear from Google.

How robots.txt works

When Googlebot arrives at your site, the very first thing it does is check /robots.txt. The file contains rules like:

User-agent: *
Allow: /
Disallow: /admin/
Disallow: /private/

Sitemap: https://example.com/sitemap.xml

This says: "All crawlers can access everything except /admin/ and /private/. And here's where the sitemap is."

Simple, right? The problem is that small mistakes have big consequences.

The mistakes that keep happening

Accidentally blocking the whole site. This is more common than you'd think:

User-agent: *
Disallow: /

That single slash after Disallow blocks everything. Every page. Your entire site goes dark in Google. It happens during development (to keep staging sites out of Google) and someone forgets to remove it before launch.

Blocking CSS and JavaScript. Old advice used to say "block your CSS and JS files." That's terrible advice now. Google needs to render your pages to understand them. Blocking these resources means Google sees a broken page.

Blocking important sections by accident. A Disallow: /blog meant to block /blog-drafts/ will also block /blog/ — your entire blog.

No robots.txt at all. Without one, crawlers access everything (which might be fine) but you lose control over crawl budget optimization and can't point crawlers to your sitemap.

What robots.txt can and can't do

Can do	Can't do
Prevent crawling of a URL	Prevent indexing (use noindex for that)
Control crawl budget allocation	Remove pages already in the index
Block specific crawlers	Guarantee protection of sensitive data
Point to your sitemap	Override a noindex directive

This is a critical distinction: robots.txt blocks crawling, not indexing. If other sites link to a page you've blocked in robots.txt, Google might still index the URL — it just won't know what's on it.

Checking your robots.txt

Every site should periodically verify that:

The file exists and is accessible at /robots.txt
Important pages aren't accidentally blocked
CSS and JavaScript files aren't blocked
The sitemap URL is included and correct
No overly broad Disallow rules are catching pages they shouldn't

Kaitico checks your robots.txt during every audit — verifying it's accessible, parsing the rules, and flagging any directives that might be blocking important content from search engines.