Cookies disclaimer

I agree Our site saves small pieces of text information (cookies) on your device in order to deliver better content and for statistical purposes. You can disable the usage of cookies by changing the settings of your browser. By browsing our website without changing the browser settings you grant us permission to store that information on your device.






Getting your site indexed by Google is a fundamental first step on the path to organic visibility. In this guide, we discuss the various components that have the power to influence how or if your site’s content gets indexed.

This is a more advanced topic, but we have broken this down into small bitesize chunks that should make it accessible to anyone with some knowledge of SEO. If you are new to SEO, you may benefit from reading our introduction to SEO first.

Index & Indexation

An index is a record or a structured list of data, in the context of SEO this refers to Google’s index of webpages. This is simply a list of all the webpages that Google crawls, and it is from this that Google builds its SERPs (Search Engine Result Pages). Hence you need your webpages to be in Google’s index in order to appear in the SERPs.

Thus, the process of getting your site / pages indexed is typically referred to under the umbrella term ‘indexation’. Any issues or problems relating to your site being indexed are referred to as indexation issues or problems.

Google Do What They Want Regardless!

It’s important to note that despite everything you do, Google will do what they want regardless. If there exists a piece of content that is set to noindex, but its hugely popular, well linked to, etc… you may find that they will still index and rank it.

We have seen pages canonically linking to other pages that rank lower than them! All of these instructions can be ignored in rare cases. But these are outliers and are not ‘the norm’.

You can significantly improve your indexation strategy by avoiding conflicts and being aware of when to use a technique or what method is best for the job. We provide examples on this in each of the below sections.


Robots!The robots.txt is literally a format free text file that sits on the root of the domain ( and contains instructions for Search Engine Crawlers (or robots).

Among other things, the robots.txt controls what pages are crawled by Search Engines (like Google). From here you can specify what you want (allow) or don’t want (disallow) from being crawled… Whether its specific pages, whole directories, file types, or sub-domains, you can stipulate if you want them to be crawled or not.

Read more about the robots.txt file.


When to use

Use this if you want to prevent Search Engines from crawling (visiting) pages or resources on your site.


  • This reduces time spent by Search Engines crawling resources that you do not want them to crawl and can help to optimise your crawl budget.
  • It’s easy to manage large volumes of pages in this way, as they can all be managed from a single location… If the site is logically structured it can be even easier to disallow content, so can be a huge time saver.


  • It’s very easy to make a mistake that can affect the whole site or large portions of it.
  • Disallowing pages means that robots won’t crawl them… This means that canonical tags, Meta robots tags, redirects and internal links will be completely ignored. Hence this method will not work in conjunction with any other method discussed on this page.
  • Due to the ease and sweeping global nature of the file and its commands, it’s common for conflicts to emerge between this and more granular indexation components.

Meta Robots Index / Noindex Tags

This is a page level tag, in that if you want to implement it, you put it on individual pages, unlike the robots.txt which is a single file.

This tag instructs search engine crawlers whether to index the page or not… A noindex tag indicates that you do not want this page to be indexed by Google or appear in the SERPs.

Read more about the Meta Robots.

When to use

*Every page should have a tag on it that determines whether the page should be indexed or not.
Use the noindex tag to stop pages from being indexed, while still allowing Search Engine robots to crawl them.


  • Robots will still crawl noindex pages, which means that canonical links, internal links, etc will still be considered…
  • Allowing authority to pass onto other pages.


  • This is very granular (page level) so on large sites, this will need to be automated or systematically managed.
  • Robots will still crawl these pages, which can be a waste of your crawl budget.

Meta Robots Follow / Nofollow Tags

This is a page level tag, in that if you want to implement it, you put it on individual pages.

This tag instructs search engine crawlers whether to ‘follow’ the links on this page… This is not intrinsically linked to indexation, but if you implement a Nofollow meta robots tag on every page of your site, you may find that you stop ranking nonetheless.

Read more about the Meta Robots.

When to use

Be sparing in your use of this tag, links are ‘follow’ by default so there is very little need to specify that all links should be followed.

Only use the Nofollow tag when you want to prevent Google from following all of the links on the page.


  • Quick way to make a lot of links ‘nofollow’
  • Compatible with other meta robots and canonical tags


  • Very few legitimate scenarios for implementation

Canonical Tags

This is a page level tag, in that if you want to implement it, you put it on individual pages.

Canonical tags contain a link to the ‘canonical’ version (preferred version, original source) of the content. In most cases this will be a self-referential tag meaning that the link will point to the URL of the page where it is located.

However, there are many scenarios where you may want to canonically link to other content, which will result in around 90% of the page authority being passed into the canonical version.

By canonically linking to another page, you are telling Google that the canonical page is the one that should be ranked. This is an indication that you do want the non-canonical page to not be indexed, but only that.

Read more about the Canonical Tags.

When to use

*Every page should have a canonical tag on it.
Use this to prevent or remove canonicalization problems.


  • The single most effective way to manage canonicalization issues
  • Page level granularity
  • Compatible with all other components (except for a robots.txt disallow)


  • If implemented systematically, there is potential for critical issues… For example, adding an identical canonical tag to every page of a site can be disastrous.

Links Follow / Nofollow

Similar to the meta robots follow / Nofollow tag, this stipulates whether a link should be followed (or pass on any authority) … But this code is applied to specific links not to the whole page and all of the links contained therein.

Read more about the Internal Linking.

When to use nofollow

  • When linking to noindex pages
  • When linking to non-canonical pages
  • When linking to disallowed pages
  • When linking to external sites
  • When linking to resources that don’t need authority to be passed onto them


  • Effective method for managing authority flow and distribution throughout your site or to specific resources
  • Compatible with canonical tags, meta noindex tags and xml sitemaps


  • Because this is about as granular as you can get, adding this tag to all links that require it, can be time consuming

XML Sitemaps

XML sitemaps should contain a concise list of all of your ‘canonical & indexable’ pages, Google uses this as its principle process for identifying your site’s content.
However, simply including a page does not ensure that it will be indexed any more than excluding a page will ensure that it’s not indexed.

Read more about the XML Sitemaps.

When to use

You should always have at least one xml sitemap listing all indexable canonical pages.


  • Used for submitting to Google
  • Helps with indexing



Indexation Conflicts

Now that you’ve brushed up on influencing factors affecting indexation, you may have spotted that there is significant room for conflicts to arise.

This is because there are so many ways that you can stipulate similar things, crawling, indexing, canonicalization, etc. Although these components may seem, and indeed are, similar, they are not identical.

Conflicts have high potential to cause indexation problems on your site, depending on the scale of this, can range from minor to critical. It’s too broad a subject to cover every eventuality so we have provided some common examples of these conflicts below.

Disallow & Index

An obvious conflict is where a resource is prevented from being crawled but is also requesting to be indexed. Google will not index what they cannot crawl.

Disallow & noindex

This is not an obvious one, but if a page was indexed and is then disallowed from the robots.txt at the same time as adding a noindex tag… You may find that the page still appears in the SERPs.

This is because Google will not crawl the page to find the newly added noindex tag. Hence this is only a problem if you want to remove something from the index.

Typically adding a noindex tag to a disallowed page just doubles down on the concept of not letting it show in the SERPs.

Index & External Canonical Link Reference

Although you can legitimately use both of these tags, it is always worth checking that this is the case when you discover both of these tags configured in this way.

Correct Example:

You have a product page that creates new URLs when the user selects different sizes or colours… Because the majority of the content stays the same on each URL, you may want to canonicalize all pages to one URL… But you have no problem with Google indexing the non-canonical content and showing it in the SERPs.

Incorrect Example:

You have created some custom landing pages for a paid campaign and have copied the bulk of the content from another page… You decide to canonicalize the custom landing pages to the page from which the content was copied, but you leave an ‘index’ meta robots tag on it.

This is incorrectly implemented, because you are allowing these pages to be indexed when there is no legitimate reason to do so.

Noindex & Self-Referential Canonical Link

In almost every case that we can think of, you would always want to canonically link to another page from pages that are not indexable. This passes on the maximum amount of authority without using a redirection.

Disallow & Any Canonical Reference

If you disallow a page from being crawled, you remove the need and functionality of canonical tag as the page will not be crawled and hence the canonical tag will not be found.

If you want ‘Page A’ to canonically link to another page, and do not want the ‘Page A’ to be indexed, use a noindex tag instead of disallowing.


30-Day Free Trial of our SEO Web Crawler Now Available, sign up with a valid email address and your name below to get instant access. No Credit Cards Required.