Cookies disclaimer

I agree Our site saves small pieces of text information (cookies) on your device in order to deliver better content and for statistical purposes. You can disable the usage of cookies by changing the settings of your browser. By browsing our website without changing the browser settings you grant us permission to store that information on your device.

Robots.txt Files

Robots.txt

Contents:

 

Robots.txt

'Robots' (also known as bots, crawlers and spiders), are programs that traverse the Web automatically in search of specified data. For example search engines like Google use them to index the content on websites.

The robots.txt file provides instructions to crawlers (referred to as user agents within robots.txt files) regarding how to index and what to index on a website. An example of a robots.txt file can be found here.

As such a robots.txt file can prevent search engines from crawling directories of the site or just specific pages. Robots.txt can also specify what parts or pages of a site should be crawled.

Robots.txt Existence

Ensure that the website utilises a robots.txt. To check this, it is as simple as trying to resolve www.example.com.au/robots.txt. A robots file is literally a text file and one a can be created in any non-mark-up based text tool such as notepad on Windows.

If none exists, simply create a new file and save it with the filename ‘robots’ then add the opening line of code below to the file (this is a blanket statement saying that the rules of this file apply to any and all user agents.

The robots.txt should be located in the root of the domain as this is where bots and crawlers will look for the file, if it isn’t present there they may not find it.

User Agent

The user agent part of the robots.txt specifies to what user agents the follow code will apply. It can be specified that just Google or Bing user agents adhere to the policy stated in this file, for example to specify some conditions for Google, open the robots.txt file with the following code:

User-agent: googlebot

Or, if you would like to specify conditions for all crawlers, we show below how to apply to all user agents by using the *.

User-agent: *

It is also possible to specify multiple user agents within a robots.txt and have unique conditions for each; we provide more complete examples of this later in this article.

Disallow

The ‘disallow’ code disallows specified bots from crawling the specified pages or directories. This prevents those pages from being indexed by Search Engines. This code should follow the ‘user-agent:’ code described above.

Specifying pages that you do not want indexed reduces the time it takes Google to crawl your site, as they do not waste time on pages you do not want indexed. Meaning that Google will spend more time indexing pages that you want indexed.

There are many reasons why you may not want pages or objects indexed, especially if they have no content, are not unique or do not contain relevant information for the SERPs.

The code below will automatically set every page and component of a site to not be indexed:

Disallow: /

Ideally use specific pages, directories or wild cards to disallow the indexation of website content for example a common ‘disallow’ would be on a site that uses WordPress for a blog, it is possible to prevent the plugins folder from being indexed using the code below:

Disallow: /wp-content/plugins/

Following on from the description of user agents, in order to specify to what bots / user agents the disallow applies to, see example below:
User-agent: rogerbot

Disallow: /

“rogerbot” is the crawler used by Moz, in the above example we have specified shows how to prevent that particular bot from crawling your site. To state conditions for multiple user agents, see the examples below.

User-agent: rogerbot
Disallow: /
User-agent: googlebot
Disallow: /wp-content/plugins/

In the above example we have prevented Moz from crawling the whole website and we have specified that Google should not crawl the WordPress plugin directory. We could continue to add user agents and specify conditions for them as desired.

Allow

This acts in the opposite way to the ‘disallow’ parameter; specifying what pages, directories, etc should be indexed. Just as it is important to specify what pages you do not want to be indexed, it can be important to specify what pages you do want indexed.

The below examples shows how to specify to all user agents that you want the whole site indexed:

User-agent: *
Allow: /

As with the disallow function you can list pages and directories, or add multiple iterations stating specifically what can be crawled.

Combining Commands

We have already covered how to use allow and disallow, here we provide an example of how these can be used together:

User-agent: Googlebot 
Disallow: /folder-a/ 
Allow: /folder-a/file.html

This prevents Google from indexing a specific folder ‘Folder-a’ but does allow the indexing of a specific file within that folder called ‘file.html’.

Other Commands

There are a huge range of commands that can be used in a robots.txt file to provide instructions to crawlers, however it should be noted that crawlers do not have to obey these rules. Typically Google will adhere to them but spammy crawlers will not. So its often not worth the time, finding a list of bad user agents and then adding them with a disallow all, as they can chose to ignore this.

Wildcard

Use of a * acts like a wildcard, for example you could state the following code which would prevent all pages with a question mark (?) in the URL from being indexed:

Disallow: /*?

This enables you to more efficiently state rules rather than specifying every page or directory and can be useful for removing duplicate content issues.

Matching

It is also possible to create a rule to disallow content from being indexed based in other factors, such as in the code below:

Disallow: /*.asp$

This states that any URL that ends with ‘.asp’ should not be indexed.

XML Sitemap Reference

Check to see if the robots.txt file includes a reference to the website's XML sitemap. It should include a line similar to the below:

Sitemap: http://www.example.com.au/sitemap.xml

This line of code in the robots.txt ensures that Google can find your XML sitemap and crawl it accordingly. If several XML sitemaps are used on a site, simply link to the index for the sitemaps, so you could have a sitemap file that links to the other sitemaps and it is this that is linked to from the robots.txt. See the article on xml sitemaps or HTML sitemaps for more information on this.

Robots Meta Tags

In addition to the robots.txt file, it is also possible to add ‘robots’ Meta tags to the source code of individual pages. This sets / specifies crawler access at page level, allowing you to stipulate how the page should be crawled and indexed.

Like the robots.txt file, the Meta tag version is an indicator and not a locked door, meaning that crawlers, especially malware, can ignore it completely.

Like all Meta tags, there are located in the <head> section of the source code and uses the Meta name ‘robots’, see examples below:

<meta name=”robots” content=”index, follow”
<meta name=”robots” content=”noindex, nofollow”
<meta name=”robots” content=”index, nofollow”

We cover robots Meta tags in more detail in another article, which you can read by following the link.

Common Mistakes

Due to the nature of the robots.txt, improper usage can cause the whole site to be de-indexed by any to all search engines.

The biggest threat that comes with this, is that it has no impact on how the site appears when looking at it through a browser. The htaccess file, which also has just as much potential for damage, will typically cause visible problems like returning errors, or making the site inaccessible…

Robots.txt will not do this, so often the first sign that something is wrong is when traffic disappears or rankings drop through the floor. At which point you are now into the damage control phase. Rankings and traffic will recover once the problem is resolved but quickness of recovering is directly linked to the duration of the problem.

Sign Up For Early Access
& Earn a Chance to Win 1 Years Free Subscription!

What You Get...

There's no obligation to become a full member after your trial, but we think that once you've seen what's available, you'll want to join us.

We are in the process of building our software and are ramping up to launch the Technical Auditing component in early 2018, soon to be followed by a suite of other components such as keyword ranking and backlink analysis.

Sign up today for 1 months free access and get a further 10% off of any package price when we launch for the first year as a reward for being an early subscriber.

Also, you will be entered into a lottery, where we will be giving away five 1-year subscriptions for free!

Sign up for early access today!