Web Crawling for SEO (Search Engine Optimisation)
In this guide we cover some of the basics of what a crawler is but take a much deeper dive into how web crawling is used for SEO.
The principle focuses of this guide centres around web crawling as a tool and its use in SEO, rather than how Google crawl or index websites.
Not only SEOs use web crawlers, but you can read one of our other guides to find out how these tools can help in other endeavours.
SEO Web Crawlers and Web Crawling
Web crawlers and scripts or programs that browse web pages and collect data from those pages in a systematic and automated way. This not only saves time but makes the job of collecting the data you need possible within the timespan of a human life!
Nowadays SEO web crawler tools are much more common, the marketplace is becoming more crowded with many different tools and features available to users. This I because SEO has itself become ubiquitous among businesses with an online presence.
SEO’s need to crawl websites to perform a wide range of tasks and as such many companies are rising to meet the demand.
SEO Crawl Data
SEO, like all forms of online marketing is data heavy, in fact one of the biggest USPs (Unique Selling Points) of SEO is the fact that you can track everything. Hence people in SEO or anyone doing SEO need data with which to make decisions, plan, strategize, optimise, or resolve issues.
In the context of crawl data for SEO, we are looking at data from a website, of which there are loads! We cover what you would use the data for in the sections below, but the list below covers most of the data you would be able to get from a site that would be commonly used in SEO. Our website crawler collects al of the data below:
Canonical Tag Data
- If a page is canonical
- Canonical URL
- Noindex tags
- If a page is indexable
- External Links
- Follow In links
- Follow Outlinks
- In links
- Unique In links
- Unique Outlinks
- Page Depth
- Broken Links
Page Speed Data
- Response Time
- Size (Kb)
- H1 (First)
- H1 (Second)
- H2 (Fifth)
- H2 (First)
- H2 (Fourth)
- H2 (Second)
- H2 (Third)
- Other H tags
- Implemented GA Tracking
- UA Number (First)
- UA Number (Second)
- Meta Description
- Meta Description (Length)
- Meta Keywords
- Meta Keywords (Length)
- Page Title
- Page Title (Length)
XML Sitemap Data
- Linked from XML Sitemap
- List of all Sitemaps
Social Sharing Tag Data
- Google+ Tags
- OG Tags
- Twitter Cards
- Text Ratio
- Word Count
- File Type
- URL Length
Technical Site Audits
Technical site audits are a very common type of SEO process that aims to analyse website for technical issues. Some technical issues can seriously impact a site’s visibility within Google and other Search Engines.
We would consider technical audits to be like a service on a car, the aim is to ensure all the nuts and bolts are where they should be and are nice and tight.
Many of the components you would analyse in a technical audit can impact indexation but do not necessarily improve performance. In order to do any of the work in a technical audit you will almost certainly need to crawl the site and collect the relevant data.
The following categories of crawl data would almost always be reviewed in a technical audit, where the aim is to bring these components in line with best practice:
You would check the indexability of pages, to do this you would look at the robots.txt file for ‘disallows’ to see what parts of the site won’t be crawled by search engines. You would also look at noindex tags, located in the source code of webpages, to see what pages won’t be indexed by search engines.
The charts above are taken from the indexation tab of Raptor’s web crawler technical SEO section.
Canonical Tag Data
Canonical data is reviewed to ensure duplicate content is not an issue and that the canonical tag configuration on a site is setup properly. This can lead to indexation problems in extreme cases. Because the canonical tag stipulates the preferred version of content, this is important for showing the right content in the SERPs (Search Engine Result Pages).
The chart below shows the types of canonical issues detected on a site, you can find this chart and much more canonical data in our crawler software.
Checking the accessibility of pages and the site in general is also vital not just to Google and other search engines, but to the users navigating a site.
Raptor’s crawling software provides you with a visual representation of your site’s accessibility:
Page Speed Data
The load time of a webpage is a highly significant ranking factor and should always be reviewed. Because of the proliferation of mobile web browsing, having your content load quickly on mobile devices as well as desktops or tablets in pivotal.
The chart below, taken from our web crawler tool shows the distribution of pages by load time in seconds:
Various things affect this such as the size of a page in Kb, the amount and size of images as well as a slew of rather technical server-side components. Crawl data can often highlight obvious or immediate issues with load times.
XML Sitemap Data
XML sitemaps are used in the indexation of a site and as such are very useful. Having these in place and correctly configured will make the indexation of a site more efficient. Listing all canonical indexable pages is best practice as is the exclusion of non-canonical and non-indexable URLs.
Ensuring that the XML sitemaps are not too big in terms of volume of links and Kb is a technical consideration.
The length and composition of URLs could be reviewed in either a technical audit or an optimisation audit. Typically, things like the length of a URL would be considered a technical component, as would the use of uppercase characters, special characters, spaces and underscores.
Optimisation (SEO) Audits
To continue the analogy that we made for technical audits, an SEO audit is like tuning a car for performance. Rather than simply checking the indicators and brakes are working, optimising a site is more akin to putting racing tyres on a car or tuning the engine for acceleration and speed.
The principle purpose of a SEO audit is to optimise a site and each page to a set of target keywords. The output of this will often involve the following:
- Keyword mapping (assigning target keywords to pages)
- Optimising site components (see below) for those target keywords
To facilitate these, you will need to review the following crawl data:
Page titles are a powerful and direct algorithmic ranking factor and should contain the target keyword/s. Meta descriptions act as a sales message and encourage click through rate from the SERPs, which is an indirect ranking factor.
To review meta data, re-write or optimise it you need to crawl the data and scrape with a web crawler. This is especially relevant when mapping keywords to web pages, which can be done easily a spreadsheet.
Looking for duplicate meta data can be a good initial indicator that there is a duplicate content issue with those pages.
The chart below is taken from the technical SEO analysis pages within Raptor’s web crawler and shows the amount of error types detected with Meta Descriptions. We show this same data for page titles also.
Social Sharing Tag Data
Although not a powerful ranking factor, having social media tags like OpenGraph optimised for keywords does help to build relevance throughout the site to those target keywords.
Understanding how much content is on a site and on each page of a site as well as the types of content (video, text, images, etc) is useful to know. Optimising said content for keywords Is a huge factor in getting your content ranked for those keywords.
The chart below is taken from our web crawler, and this shows the distribution of pages by word count:
Like most SEO components, using the keyword within the URL of a page can influence rankings in several ways. Firstly, it this will drive relevance to the keyword and secondly it will help with click through rate which is itself a ranking factor.
There are multiple components to consider with linking data, all of which can affect organic visibility.
The charts below show the linking data that we provide in our web crawler analysis pages:
Volume of Links
The volume of internal links that a page has will affect both the authority being passed to the page but also is an indicator to Google of who important that page is.
The anchor text used in links helps to drive relevance to the page being linked to and hence keyword optimising these is commonplace.
Often seen as your link neighbourhood, you need crawl data from a site to know where the site is linking out to. Poor- or low-quality sites and irrelevant sites will help to create a bad neighbourhood and should not be linked to in general.
Too many external links if not handled properly can also serve to leak authority.
Follow vs NoFollow
Follow links pass authority whereas nofollow links do not.
Crawl data will show how many links a page has that are follow or no follow, there are many reasons you would look at this data… For example, pages that you don’t aren’t indexed do not need follow links, whereas indexable content should have authority passed to it.
Pages with very few follow links will often not rank very well, crawl data can help to identify these pages quickly and easily.
There are many headers, crawling a site to identify them can help you to discern the structure of content on a page and the nature of the content. Optimising these by using keywords within the H tags is also a SEO technique used to drive relevance.
Crawl data can also help to identify content that has been over optimised, and risks being seen as low quality, manipulative or against best practice.
Any web crawler will allow you to crawl a competitor site but with Raptor we have structured our software to allow you make projects. Within a project you can add competitors to the groups aptly named ‘competitors’. More than just crawling competitor sites, we also analyse this data for you and present you with both graphical visualisations and tables of data.
In so doing you can very easily see where your site/s sit within the competitive landscape in a number of areas. A good example of this is comparing the content on one site to that of the competitors:
- Volume of HTML pages
- Volume of images
- Number of images
- Word count
- Average number of words per page
This is super useful for the following activities:
This is a technique used to see what the benchmarks for success are in any given area. For example, if your site has 20 pages and 2,000 words this might put your site at the bottom of the pile. The competitor average might be 50 pages and 20,000 words, which means in order to effectively compete you will need to generate content to match that average. This would make your site more competitive.
Comparing multiple aspects such as content and page speed with other data such as backlink data; you can build an SEO snapshot to what the strength and weaknesses of your site are within the competitive landscape.
This type of snapshot can be used when pitching to potential clients or can inform the direction of your SEO strategy.
As mentioned above, competitive data like this can help to inform an SEO strategy. For example, the table below shows ‘Your Site’ and a bunch of competitors:
In this example, the obvious opportunity is in video content, as the competition is not effectively competing in this space. It is also going to be easier to generate 200K words of content than it is to up the domain authority from 23 to 37 (the next highest competitor score).
Scraping page titles and H1 headers from competitor sites is an easy way to see what keywords they target (assuming the site is optimised properly). You can extract keywords from headers and page titles as well as seeing the content they produce.
This data can inform your strategy, highlight holes or opportunities in your own site’s content. Because you don’t want to straight up copy your competitors, this data can also help you see the gaps in your competitors content.
You may notice when crawling multiple competitor sites that they all ubiquitously have certain types of content or target specific keywords. This can be an indication that you should have this content on your site and be targeting those keywords.
Google Penalty Removal
There are several reasons why Google might apply a penalty to a website, this is a big topic and too much to cover in depth in this guide. There are various conditions that can result in a penalty that are the result of on-page activity.
Keyword stuffing and hidden text are good examples, as crawling a site and analysing the crawl data can help to identify said issues.
The Power of Regular Crawls & Historical Crawl Data
Raptor’s web crawler stores all historical crawl data, regularly crawling a website means that you can see changes more easily over time. Over time data is valuable when plotting the trajectory of certain metrics.
For example, plotting the volume of content (pages and word count), page speed, inaccessible pages, site errors, etc can provide valuable SEO insights. This data can answer questions such as if the site is improving or worsening from a technical perspective.
Tracking the same data mentioned above for competitor sites adds highly valuable context to your data. For example, the first chart below shows some word count data over 13 months for a site:
You might look at this and think, “great, we are growing our content”! However, when put in the context of the competitive landscape, you might come to a different conclusion:
Context is everything when it comes to data, the chart below shows the competitor average vs ‘your site’:
This type of analysis can be done on almost any metric and any set of sites to help you understand not only where you are but where you need to be to compete. This can help to provide answers to the following questions (and many more):
- Why are we competitors beating us?
- How to we more effectively compete?
- What is the reason we are beating our competitors?
- Why were things better in the past?
Website Crawl Data & Analytics Data
Correlating crawl data such as content, as in the example above, with other types of analytics data such as traffic or ranking data is another form of adding context. Rather than competitive context, this can help to provide answers to the following questions (and many more):
- Why is traffic declining? (suddenly or slowly)
- Why is traffic increasing? (suddenly or slowly)
- Why has the site dropped out of the SERPs?
- What is working?
- What is not working?
- What value has your content provided?
- What value has the SEO investment delivered?
Web Crawlers for SEO Agencies & Consultants
If you have more than a couple of clients means that you need to automate as much as possible to save time and direct your efforts into the areas with the biggest impact. A web crawler can provide a high degree of automation:
Pinging Alerts Out When Critical Issues Are Detected
This is very useful as it can act as an early warning signal, allowing you to fix an issue before Google detect it. Contacting clients with critical errors that you have detected is also great customer service, making your services invaluable.
Historical Data Can Help You to Resolve Issues
If a client site experiences a significant problem, having months of historical crawl data at your fingertips can be all you need to identify & resolve the problem.
Get Notified If the Client Makes Changes to Their Site with Keeping You in The Loop
This happens more frequently than anyone would like, often clients will hire a web developer or in-house employee who decides to make sweeping changes.
Completely changing the URL structure without setting up proper redirects or revamping content without optimising it… This can all create a massive problem for the site’s SEO. Setting up an alert, or just checking the latest scheduled crawl data can put you in the loop.
Demonstrate the Value of Your SEO Work
This is something every SEO must do at some point; historical crawl data and competitive data can help you to demonstrate your value. This is especially useful when correlating analytics or ranking data with crawl data.
Conversely if you are the client of a SEO agency or freelancer, many these benefits will help you to determine the value of what your SEO people.