Total Pageviews

What is Robots.txt

Robots.txt is a text (not html) file you put on your site to tell search robots which pages you would like them not to visit. Robots.txt is by no means mandatory for search engines but generally search engines obey what they are asked not to do. It is important to clarify that robots.txt is not a way from preventing search engines from crawling your site (i.e. it is not a firewall, or a kind of password protection) and the fact that you put a robots.txt file is something like putting a note “Please, do not enter” on an unlocked door – e.g. you cannot prevent thieves from coming in but the good guys will not open to door and enter. That is why we say that if you have really sen sitive data, it is too naïve to rely on robots.txt to protect it from being indexed and displayed in search results.

Robots.txt Notes

  • The exact mixed-case directives may be required, so be sure to capitalize Allow: and Disallow: , and remember the hyphen in User-agent:
  • An asterisk (*) after User-agent:: means all robots. If you include a section for a specific robot, it may not check in the general all robots section, so repeat the general directives.
  • The user agent name can be a substring, such as "Googlebot" (or "googleb"), "Slurp", and so on. It should not matter how the name itself is capitalized.
  • Disallow tells robots not to crawl anything which matches the following URL path
  • Allow is a new directive: older robot crawlers will not recognize this.
    • Historical Note: the 1996 robots.txt draft RFC actually did include "Allow". But everyone seems to have ignored that until around 2005, and even then, it was not documented.
  • URL paths are often case sensitive, so be consistent with the site capitalization
  • The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
  • In the original REP directory paths start at the root for that web server host, generally with a leading slash (/). This path is treated as a right-truncated substring match, an implied right wildcard.
  • One or more wildcard (*) characters can now be in a URL path, but may not be recognized by older robot crawlers
  • Wildcards do not lengthen a path -- if there's a wildcard directive path that's shorter, as written, than one without a wildcard, the one with the path spelled out will generally override the one with the wildcard.
  • Sitemap is a new directive for the location of the Sitemap file
  • A blank line indicates a new user agent section.
  • A hash mark (#) indicates a comment

Example Robots.txt Format

Allow indexing of everything
User-agent: *
Disallow:
Disallow indexing of everything
User-agent: *
Disallow: /
Disawllow indexing of a psecific folder
User-agent: *
Disallow: /folder/
Disallow Googlebot from indexing of a folder, except for allowing the indexing of one file in that folder
User-agent: Googlebot
Disallow: /folder1/
Allow: /folder1/myfile.html

0 comments:

Post a Comment