- Robots.txt Existence
- User Agent
- Combining Commands
- Other Commands.
- XML Sitemap Reference
- Robots Meta Tags
- Common Mistakes
'Robots' (also known as bots, crawlers and spiders), are programs that traverse the Web automatically in search of specified data. For example search engines like Google use them to index the content on websites.
The robots.txt file provides instructions to crawlers (referred to as user agents within robots.txt files) regarding how to index and what to index on a website. An example of a robots.txt file can be found here.
As such a robots.txt file can prevent search engines from crawling directories of the site or just specific pages. Robots.txt can also specify what parts or pages of a site should be crawled.
Ensure that the website utilises a robots.txt. To check this, it is as simple as trying to resolve www.example.com.au/robots.txt. A robots file is literally a text file and one a can be created in any non-mark-up based text tool such as notepad on Windows.
If none exists, simply create a new file and save it with the filename ‘robots’ then add the opening line of code below to the file (this is a blanket statement saying that the rules of this file apply to any and all user agents.
The robots.txt should be located in the root of the domain as this is where bots and crawlers will look for the file, if it isn’t present there they may not find it.
The user agent part of the robots.txt specifies to what user agents the follow code will apply. It can be specified that just Google or Bing user agents adhere to the policy stated in this file, for example to specify some conditions for Google, open the robots.txt file with the following code:
Or, if you would like to specify conditions for all crawlers, we show below how to apply to all user agents by using the *.
It is also possible to specify multiple user agents within a robots.txt and have unique conditions for each; we provide more complete examples of this later in this article.
The ‘disallow’ code disallows specified bots from crawling the specified pages or directories. This prevents those pages from being indexed by Search Engines. This code should follow the ‘user-agent:’ code described above.
Specifying pages that you do not want indexed reduces the time it takes Google to crawl your site, as they do not waste time on pages you do not want indexed. Meaning that Google will spend more time indexing pages that you want indexed.
There are many reasons why you may not want pages or objects indexed, especially if they have no content, are not unique or do not contain relevant information for the SERPs.
The code below will automatically set every page and component of a site to not be indexed:
Ideally use specific pages, directories or wild cards to disallow the indexation of website content for example a common ‘disallow’ would be on a site that uses WordPress for a blog, it is possible to prevent the plugins folder from being indexed using the code below:
Following on from the description of user agents, in order to specify to what bots / user agents the disallow applies to, see example below:
“rogerbot” is the crawler used by Moz, in the above example we have specified shows how to prevent that particular bot from crawling your site. To state conditions for multiple user agents, see the examples below.
In the above example we have prevented Moz from crawling the whole website and we have specified that Google should not crawl the WordPress plugin directory. We could continue to add user agents and specify conditions for them as desired.
This acts in the opposite way to the ‘disallow’ parameter; specifying what pages, directories, etc should be indexed. Just as it is important to specify what pages you do not want to be indexed, it can be important to specify what pages you do want indexed.
The below examples shows how to specify to all user agents that you want the whole site indexed:
As with the disallow function you can list pages and directories, or add multiple iterations stating specifically what can be crawled.
We have already covered how to use allow and disallow, here we provide an example of how these can be used together:
This prevents Google from indexing a specific folder ‘Folder-a’ but does allow the indexing of a specific file within that folder called ‘file.html’.
There are a huge range of commands that can be used in a robots.txt file to provide instructions to crawlers, however it should be noted that crawlers do not have to obey these rules. Typically Google will adhere to them but spammy crawlers will not. So its often not worth the time, finding a list of bad user agents and then adding them with a disallow all, as they can chose to ignore this.
Use of a * acts like a wildcard, for example you could state the following code which would prevent all pages with a question mark (?) in the URL from being indexed:
This enables you to more efficiently state rules rather than specifying every page or directory and can be useful for removing duplicate content issues.
It is also possible to create a rule to disallow content from being indexed based in other factors, such as in the code below:
This states that any URL that ends with ‘.asp’ should not be indexed.
Check to see if the robots.txt file includes a reference to the website's XML sitemap. It should include a line similar to the below:
This line of code in the robots.txt ensures that Google can find your XML sitemap and crawl it accordingly. If several XML sitemaps are used on a site, simply link to the index for the sitemaps, so you could have a sitemap file that links to the other sitemaps and it is this that is linked to from the robots.txt. See the article on xml sitemaps or HTML sitemaps for more information on this.
In addition to the robots.txt file, it is also possible to add ‘robots’ Meta tags to the source code of individual pages. This sets / specifies crawler access at page level, allowing you to stipulate how the page should be crawled and indexed.
Like the robots.txt file, the Meta tag version is an indicator and not a locked door, meaning that crawlers, especially malware, can ignore it completely.
Like all Meta tags, there are located in the <head> section of the source code and uses the Meta name ‘robots’, see examples below:
<meta name=”robots” content=”index, follow”
<meta name=”robots” content=”noindex, nofollow”
<meta name=”robots” content=”index, nofollow”
We cover robots Meta tags in more detail in another article, which you can read by following the link.
Due to the nature of the robots.txt, improper usage can cause the whole site to be de-indexed by any to all search engines.
The biggest threat that comes with this, is that it has no impact on how the site appears when looking at it through a browser. The htaccess file, which also has just as much potential for damage, will typically cause visible problems like returning errors, or making the site inaccessible…
Robots.txt will not do this, so often the first sign that something is wrong is when traffic disappears or rankings drop through the floor. At which point you are now into the damage control phase. Rankings and traffic will recover once the problem is resolved but quickness of recovering is directly linked to the duration of the problem.