There is a hidden, relentless force that permeates the web and its billions of web pages and files, unbeknownst to the majority of us sentient beings. I'm talking about search engine
crawlers and robots here. Every day hundreds of them go out and scour the web, whether it's Google trying to index the entire web, or a spam bot collecting any email address it could find for less than honorable intentions. As site owners, what little control we have over what robots are allowed to do when they visit our sites exist in a magical little file called "robots.txt."
"Robots.txt" is a regular text file that through its name, has special meaning to the majority of "honorable" robots on the web. By defining a few rules in this text file, you can instruct robots to not crawl and index certain files, directories within your site, or at all. For example, you may not want Google to crawl the /images directory of your site, as it's both meaningless to you and a waste of your site's bandwidth. "Robots.txt" lets you tell Google just that.
So lets get moving. Create a regular text file called "robots.txt", and make sure it's named exactly that. This file must be uploaded to the root accessible directory of your site, not a subdirectory (ie: http://www.technetsource.com/ but NOT http://www.technetsource.com/stuff/). It is only by following the above two rules will search engines interpret the instructions contained in the file. Deviate from this, and "robots.txt" becomes nothing more than a regular text file, like Cinderella after midnight.
Now that you know what to name your text file and where to upload it, you need to learn what to actually put in it to send commands off to search engines that follow this protocol (formally the "Robots Exclusion Protocol"). The format is simple enough for most intents and purposes: a USERAGENT line to identify the crawler in question followed by one or more DISALLOW: lines to disallow it from crawling certain parts of your site.
1) Here's a basic "robots.txt":
User-agent: *
Disallow: /
With the above declared, all robots (indicated by "*") are instructed to not index any of your pages (indicated by "/"). Most likely not what you want, but you get the idea.
2) Lets get a little more discriminatory now. While every webmaster loves Google, you may not want Google's Image bot crawling your site's images and making them searchable online, if just to save bandwidth. The below declaration will do the trick:
User-agent: Googlebot-Image
Disallow: /
3) The following disallows all search engines and robots from crawling select directories and pages:
User-agent: *
Disallow: /cgi-bin/
Disallow: /privatedir/
Disallow: /tutorials/blank.htm
4) You can conditionally target multiple robots in "robots.txt." Take a look at the below:
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow: /cgi-bin/
Disallow: /privatedir/
This is interesting- here we declare that crawlers in general should not crawl any parts of our site, EXCEPT for Google, which is allowed to crawl the entire site apart from /cgi-bin/ and /privatedir/. So the rules of specificity apply, not inheritance.
5) There is a way to use Disallow: to essentially turn it into "Allow all", and that is by not entering a value after the semicolon(:):
User-agent: *
Disallow: /
User-agent: ia_archiver
Disallow:
Here I'm saying all crawlers should be prohibited from crawling our site, except for Alexa, which is allowed.
6) Finally, some crawlers now support an additional field called "Allow:", most notably, Google. As its name implies, "Allow:" lets you explicitly dictate what files/folders can be crawled. However, this field is currently not part of the "robots.txt" protocol, so my recommendation is to use it only if absolutely needed, as it might confuse some less intelligent crawlers.
Per Google's FAQs for webmasters, the below is the preferred way to disallow all crawlers from your site EXCEPT Google:
User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /
The "robots" meta tag
If your web host prohibits you from uploading "robots.txt" to the root directory, or you simply wish to restrict crawlers from a few select pages on your site, an alternative to "robots.txt" is to use the robots meta tag.
Creating your "robots" meta tag
The "robots" meta tag looks similar to any meta tag, and should be added between the HEAD section of your page(s) in question:
<meta content="noindex,nofollow" />
Here's a list of the values you can specify within the "contents" attribute of this tag:
|
Value |
Description |
|
(no)index |
Determines whether crawler should index this page. Possible values: "noindex" or "index" |
|
(no)follow |
Determines whether crawler should follow links on this page and crawl them. Possible values: "nofollow" and "follow." |
Here are a few examples:
1) This disallows both indexing and following of links by a crawler on that specific page:
<meta content="noindex,nofollow" />
2) This disallows indexing of the page, but lets the crawler go on and follow/crawl links contained within it.
<meta content="noindex,follow" />
3) This allows indexing of the page, but instructs the crawler to not crawl links contained within it:
<meta content="index,nofollow" />
4) Finally, there is a shorthand way of declaring 1) above (don't index nor follow links on page):
<meta content="none">
Tags: -
Related entries:
Last update: 2010-01-31 01:04
Author: Admin
Revision: 1.3