build-website-header
 

The Robots.txt File

 

The robots.txt file is a simple (optional) text file that contains directives for compliant search engines.

It basically tells search engines what they can and what they can't index.

When a search engine visits your site it looks for the robots.txt file in the root of the site, and if present it reads the directives, and if not present it carries on and indexes your site.

You should note that you can only have one robots.txt file per site and it should be in the root directory i.e. for my site the correct location is:

www.build-your-website.co.uk/robots.txt

and not:

www.build-your-website.co.uk/newsletter/robots.txt

What it can Do

Using simple directives you can:

  • Exclude all search engines from individual files or directories.
  • Exclude individual search engines from individual files or directories.
  • Tell search engines the name and location of the sitemap file (new directive).

Why You Should Use It.

  • It can provide basic security/privacy by keeping pages out of the search engines that you want to be kept private.
  • It can prevent bad link problems. If you remove a page from you site the search engine may still think its there and get a 404 error. You can prevent this by instructing the search engine not to bother trying to index the file.
  • You can prevent duplicate content problems. If you have an article on robots.txt file and also include most of the article in a blog post or newsletter then the search engines may see it as duplicate content. The robots.txt file can be used to tell the search engines to ignore the blog post or newsletter.

 

Format

You should note that unless explicitly disallowed search engines are allowed to crawl your site/pages.

The file consists of a series of record entry with each entry being delimited by a new line. Each record consists of a User agent directive followed by a series ( 1 or more) disallow directives.

Examples:

This single record entry stops all search engines from indexing the site:

User-agent: *
Disallow: /

 

This two record entry stops all the search engine called badrobot from indexing the site, and all search engines from indexing the directories called apps and private:

User-agent: badrobot
Disallow: /

User-agent: *
Disallow: /apps/
Disallow: /private/

 



This is the same example as above but using comments and the sitemap directive.

# Stop badrobot from indexing site
User-agent: badrobot
Disallow: /

#Stop all from indexing apps and private directories
User-agent: *
Disallow: /apps/
Disallow: /private/

Sitemap: http://www.sitename.com/sitemap.xml

You should not that the sitemap directive is independent of user agent and so no user agent directive is required. The sitemap directive can appear anywhere in the robots.txt file.

 

Robots.txt File Creation and Checking

You can create a robots.txt file using any text editor like notepad. However if you are uneasy about it there are a number of free online tools that will create the file for you and others that will check the syntax to make sure there are no errors.

You should also Use Google Webmaster Tools (Webmaster tools overview) for checking the robots.txt file will perform as expected. This I feel is a must because an incorrectly configured robots.txt file could be disastrous to your site.

Resources:


 

spacer2-image