Everything you need to know about robots.txt

Last Updated: Feb. 17th 2022 at 11:40pm

Robots.txt doesn’t quite work as most people assume, so I am writing this article to clear things up.

I’m going to focus more on how search engines use robots.txt and not much on how to use robots.txt.
For a more in depth reference on how to use robots.txt checkout: robotstxt.org.

Basic robots.txt file

Here is an example robots.txt file that will block crawling of the cache, tmp and home directories:

User-agent: *
Disallow: /cache/
Disallow: /tmp/
Disallow: /~joe/

The User-agent gives instructions to specific robots (in this case all).
You can find an existing robots in Robotstxt.org’s database. Here is an example of how you would stop google from crawling your images, possibly to save bandwidth:

User-agent: Google
Disallow: /images

User-agent: *
Disallow: /cache/
Disallow: /tmp/
Disallow: /~joe/

Link Still shows up in Google

Matt Cutts talks about how Google uses robots.txt vs noindex.

In summary the robots.txt just tells the search engine crawlers if they can crawl the page. Robots.txt doesn’t tell the robots if they can index the pages.

Everything you need to know about robots.txt

Basic robots.txt file

Link Still shows up in Google

Reference