Robots.txt Tutorial

Robots.txt Tutorial
Robots.txt Tutorial

IMPORTANT: be very careful when playing about with your robots.txt, its very easy to block off your entire site without realising!…

Covered in this post:

What is a Robots.txt?

A Robots.txt is a file used by  webmaster  to advise spiders and bots (e.g. googlebot)  where in the website they are allowed to crawl. The robots.txt is stored in a website’s root folder, for example:  http://www.domain.com/robots.txt

When should i use Robots.txt?

Your robots.txt should be used to help bots like Googlebot crawl your website and guide where they aren’t supposed to go.

DO NOT use a robots.txt to try and block scrapers or other heavy handed crawlers. At the end of the day it’s up to the bot whether they respect your robots.txt or not, so in all likely hood it won’t even get read by these crawlers.

Another important thing worth mentioning is that anyone can see your robots.txt. So bare this in mind where you are writing it, you don’t want to include anything like: Disallow: /my-secret-area/

How does it all work?

The best way to learn how it works is probably to look at some examples. In its simplest form the contents of a robots.txt might look like this:

User-agent: *
Disallow:

In this example the definitions is saying, for ALL (*) User-Agents, Disallow nothing i.e. feel free to crawl anywhere you want.

To do the opposite and block EVERYTHING you would use:

User-agent: *
Disallow: /

You can also use the ALLOW property which works in the opposite way.

ALLOW EVERYTHING

User-agent: *
Allow: /

BLOCK EVERYTHING

User-agent: *
Allow:

Googlebot and My Robots.txt

You can specify a rule for just Googlebot by using the User-agent property.

BLOCK GOOGLEBOT FROM ACCESSING YOUR SITE

User-agent: Googlebot
Disallow: /

BLOCK GOOGLEBOT-IMAGE FROM ACCESSING YOUR SITE

User-agent: Googlebot-Image
Disallow: /

Google (and some other bots) respect the * or greedy character. This can be very helpful for blocking off areas which contain similar URL parameters. The robots example below tells Googlebot NOT to access anything with a ? in it.

User-agent: Googlebot
Disallow: /*?

Please note, in the past i’ve noticed defining a Googlebot user agent rule has caused Googlebot to completely ignore all other User-agent: * rules… I don’t know whether they have changed this yet but its worth baring it in mind.

[UPDATE] WARNING: If you have ‘User-Agent: *’ AND a ‘User-Agent:Googlebot’, Googlebot will ignore everything you defined in the * definition!!!! I don’t think many people realise this so be very careful. Remember, ALWAYS test your changes using Google Webmaster Tool’s Robots.txt Tool. If you don’t have a account for Google Webmaster Tools, GET ONE NOW!

[UPDATE END]

The X-Robots-Tag HTTP header

You can also define your robots at a page level by using a robots meta tag. The first of these examples is basically the default a bot would assume if no definition was provided…

<meta content="index, follow" name="robots" /> (default - no robots tag)
<meta content="noindex, follow" name="robots" />
<meta content="index, nofollow" name="robots" />
<meta content="noindex, nofollow" name="robots" />

Please note, the NOFOLLOW in these examples should not be confused with the rel=”nofollow” of links.

Useful Tools / Resources


Leave a Reply