Robots.txt files can be used to block bots (the most common use) and at the same time specific which files can be crawled. To do this you should use the Allow directive. See example below.
Allow: /private/public.doc
Disallow: /private/
According to Bing if “there is some logical confusion and both Allow and Disallow directives apply to a URL, the Allow directive takes precedent.”
Other things you can or should do with your robots.txt file include:
Wildcards – Wildcards can be used in a variety of ways in robots.txt files such as :
- Blocking bots from accessing all URLs that contain a specific directory name;
- Blocking bots from accessing all URLs that end with a specific string regardless of the directory where it is found; and
- Blocking bots from accessing all URLs that contain a specific line anywhere in their URL string.
All the above can be done using the “*”character, which is used to represent characters appended to the strong of a URL. To filter by file extension the character “$”must be used.
XML Sitemaps – Make sure to add a reference to your sitemap at the end of your robots.txt file to make it easy for bots to make its way through your site. To reference a sitemap use the following syntax:
Sitemap: http://www.your-url.com/sitemap.xml
File format – Make sure to save your robots.txt file in a standard file format such as ASCII or UTF-8.
Validate at Bing’s Webmaster Central – They have an online robots.txt validation tool. If you are not a member then join or at least use other online validation tools.
Source: Prevent a bot from getting “lost in space” (SEM 101)



























Recent Comments