Robots.txt doesn't quite work as most people assume, so I am writing this article to clear things up.
I'm going to focus more on how search engines use robots.txt and not much on how to use robots.txt. For a more in depth reference on how to use robots.txt checkout: robotstxt.org.
Basic robots.txt file
Here is an example robots.txt file that will block crawling of the cache, tmp and home directories:
User-agent: * Disallow: /cache/ Disallow: /tmp/ Disallow: /~joe/
The User-agent gives instructions to specific robots (in this case all). You can find an existing robots in Robotstxt.org's database. Here is an example of how you would stop google from crawling your images, possibly to save bandwidth:
User-agent: Google Disallow: /images User-agent: * Disallow: /cache/ Disallow: /tmp/ Disallow: /~joe/
Link Still shows up in Google
Matt Cutts talks about how Google uses robots.txt vs noindex.
In summary the robots.txt just tells the search engine crawlers if they can crawl the page. Robots.txt doesn't tell the robots if they can index the pages.