What is Robots.txt
Robots.txt is a text (not html) file you put on your site to tell
search robots which pages you would like them not to visit.
Robots.txt is by no means mandatory for search engines but generally
search engines obey what they are asked not to do. It is important to
clarify that robots.txt is not a way from preventing search engines
from crawling your site (i.e. it is not a firewall, or a kind of
password protection) and the fact that you put a robots.txt file is
something like putting a note “Please, do not enter” on
an unlocked door – e.g. you cannot prevent thieves from coming
in but the good guys will not open to door and enter. That is why we
say that if you have really sen sitive data, it is too naïve to
rely on robots.txt to protect it from being indexed and displayed in
search results.
Robots.txt Notes
- The exact mixed-case directives may be required, so be sure to capitalize Allow: and Disallow: , and remember the hyphen in User-agent:
- An asterisk (*) after User-agent:: means all robots. If you include a section for a specific robot, it may not check in the general all robots section, so repeat the general directives.
- The user agent name can be a substring, such as "Googlebot" (or "googleb"), "Slurp", and so on. It should not matter how the name itself is capitalized.
- Disallow tells robots not to crawl anything which matches the following URL path
Allow is a new directive: older robot crawlers will not recognize this.
- Historical Note: the 1996 robots.txt draft RFC actually did include "Allow". But everyone seems to have ignored that until around 2005, and even then, it was not documented.
- URL paths are often case sensitive, so be consistent with the site capitalization
- The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
- In the original REP directory paths start at the root for that web server host, generally with a leading slash (/). This path is treated as a right-truncated substring match, an implied right wildcard.
One or more wildcard (*) characters can now be in a URL path, but may not be recognized by older robot crawlers
- Wildcards do not lengthen a path -- if there's a wildcard directive path that's shorter, as written, than one without a wildcard, the one with the path spelled out will generally override the one with the wildcard.
Sitemap is a new directive for the location of the Sitemap file
- A blank line indicates a new user agent section.
- A hash mark (#) indicates a comment
Example Robots.txt Format
Allow indexing of everythingUser-agent: *Disallow indexing of everything
Disallow:
User-agent: *Disawllow indexing of a psecific folder
Disallow: /
User-agent: *Disallow Googlebot from indexing of a folder, except for allowing the indexing of one file in that folder
Disallow: /folder/
User-agent: Googlebot
Disallow: /folder1/
Allow: /folder1/myfile.html
0 comments:
Post a Comment