Types of Website Crawler
Website Crawlers are used in many places all the net and in many of the processes which are part of digital marketing. Even if you don’t personally use a web crawler, you will indirectly benefit from them due to their prevalence in internet technologies. In this guide we cover all the different types and uses of website crawlers that are around today.
A website crawler (or web crawler) is a piece of software that crawls and scrapes data from web pages, websites and the files that websites are built from.
Search Engine Website Crawlers
One of the most common and long-standing implementations of website crawlers are those used by search engines such as Google. Web crawlers form the foundations of search engines as they are used to crawl and scrape the internet. This process of crawling and scraping precipitates the indexation of web content, which itself facilitates the search results you find when “Googling” something.
This type of website crawler is run by vast server farms that span countries and continents; the data they scrape is also stored in epic server farms that look more like warehouses. In order to scrape and store the enormous amounts of data that exists on the internet, you need enormous amounts of servers and hard drives.
This type of website crawler isn’t something you can personally leverage in any direct way though. Google and other search engines often provide tools that allow you to see, access and use the data that they collect. In the case of Google, this would be ‘Search Console’ (previously known as Webmaster Tools). Within this tool, you can submit a request to crawl or index a page and submit sitemaps for the same purpose.
Google’s crawler is called ‘Googlebot’ and you can influence it by customising tags and files on your site such as the robots.txt and noindex tag. But this is not the same as having direct control over the website crawler itself.
Personal Website Crawlers
A personal website crawler is something that individuals or businesses might make to use in house or just for themselves. They are often built to perform one or two specific jobs like scraping data from the search results or monitoring whether specific webpages are down. These are typically good at performing a small and specific job but are likely to fail when deployed to work at scale.
There are many guides on the net that show you how to make a python web crawler for example, or a desktop-based program that runs from your local machine. As mentioned, these can work well on a small scale such as scraping paid ads once or twice per day for a small set of keywords.
These types of website crawlers are limited in their scalability and functionality, but unlike those used by search engines, you do have control over them.
Commercial Website Crawlers
To gain access to greater control, more features, and a scalable solution you will need to use a commercially developed website crawler. These are tools available for purchase from software solution companies like Raptor.
Commercial software has been developed to overcome the limitations of smaller personal use tools, often requiring huge amounts of development time, testing and real-world use. Website crawlers like this are more robust, come complete with a wide range of features; and are often able to meet many different needs rather than a singular specific purpose.
That said there are many flavours and types of commercial website crawler available or that are used to collect data for a broader purpose. We discuss these in greater detail in the following sections.
Desktop Website Crawlers
A desktop-based web crawler is run locally from your own home / office computer. Still a commercial product complete with extensive features, these are still somewhat limited. Although great for analysing small websites, they often need more knowledge and massaging to work on sites with tens or hundreds of thousands of pages.
Other downsides to desktop-based website crawlers are as follows:
· Data is stored locally on your machine
· Updates to the software or the OS (Operating System) can cause problems with functionality
· You need to be in front of the computer that has the software installed to use it
· Crawling large sites can take a long time or requires custom setup or additional products like proxy servers
The flip side is that many of these products are cheap, making them accessible to smaller businesses and freelancers who need to use them in a limited capacity.
Cloud-Based Website Crawlers
Otherwise referred to as Software as a Service (SaaS), these are typically more modern products that are run online. This means that you can access the software from any computer (or device) with an internet connection and your login details. Cloud-based tools take advantage of distributed networks and remote servers to both store your data and provide comprehensive scalability.
This is why Raptor uses a cloud-based approach to our software solutions. Compared to desktop-based website crawlers, this translates into the following benefits for the average user:
· You can access your data from anywhere anytime
· Set crawls running without needing to keep your computer on
· Crawl sites of any size without any more hassle
· Avoid issues with system or software updates
· Avoid issues with using varied operating systems such as Linux, Unix, etc
· Collaborative projects & endeavours are more efficient
· No need to buy multiple licenses for use on different machines
As you can see there are huge benefits to using a cloud-based vs a desktop-based technology. With Raptor you get all the following with any of our subscription plans:
· Actionable hints and recommendations
· Compare crawl data between any crawls
· Crawl large websites without having to massage the tool to do it
· Crawl on demand
· Organize crawls in projects
· Realtime crawl data
· Visual interactive interface
Cloud-based website crawlers are growing in popularity as a tool for SEOs and businesses with a vested interest in SEO data. We cover the uses of these tools in the following sections of this guide.
Website Crawling for SEO
One of the principles uses for commercial website crawlers is SEO (Search Engine Optimisation). There are various processes in SEO that would be impossible without a tool to automate the data collection and aspects of the analysis. Unless a site is just a handful of pages, collecting data from them will take more time than its worth. On sites with thousands of pages, this process become impossible to do manually.
Even when it comes to analysing the data, there is an intrinsic need to use tools. Historically people would use a spreadsheet and formulas to calculate things like how many pages have specific errors. Now, tools like that offered by Raptor, provide much of this analysis as part of the process of collecting the data. For example, it’s easy for a script to check whether a meta description is missing when the page is crawled.
As such web crawlers and SEO tools have become an indispensable part of daily life for people doing SEO. The sections below discuss in greater depth the specific processes and uses of website crawlers within SEO.
These can take many forms but typically fall into one of two categories (if not both at the same time).
The first is a technical website audit, where you are looking to identify any technical errors or problems with a website. Whether the website crawler provides you with just raw data or an interface with which to navigate an analysis of technical problems; a tool is essential.
The second is an optimisation audit where the aim to optimise the site for target keywords; with recommendations being made at the page component level to improve relevance for target keywords.
In both cases, you will need a complete list of all the sites URLs and the SEO components mapped to those pages to perform the audit or make recommendations. Raptor’s web crawler will enable you to do both of these with ease.
There are many ways to analyse competitors, some of those require you to crawl the competitor sites and analyse the data. You would do this to look for weaknesses, opportunities and benchmarking.
Raptor allows you to add competitors to projects, crawl those sites and see how your site compares. This is useful for many reasons, such as benchmarking or collating data for a pitch or the top-level of an SEO strategy.
Monitoring & Alerts
Monitoring specific pages such as the homepage of a site on a regular basis, such as once every 30mins will allow you to ensure uptime. Combined with automated email or text alerts, you can be notified easily and quickly when a page is determined to be down. For SEO agencies, this is a valuable use of website crawling technology as it means you can notify clients of major site issues.
When migrating a website to a new design, CMS or URL structure you will need to capture a complete list of URLs both before and after the migration. Comparing the two crawls or crawling a defined list of URLs to ensure that they redirect as intended allows you to identify errors.
Website Crawling for SEO Tools
There are a couple of very common uses for website crawling software in SEO tools where you do not use the crawler directly.
Some SEO tools use website crawlers to collect the data that they provide, for example a backlink analysis tool will crawl billions to trillions of webpages to build a database of inter-site links. With tools like this, the product being sold is the data rather than the crawler itself. The crawler simply facilitates the collection of the data.
In much the same vein as with backlink data, keyword ranking tools use a website crawler to collect data from Search Engines. The crawler, scrapes data from search results based on a set of keywords defined by the user and the specific site being checked (also provided by the user). These tools always come with a range of options such as:
· What search engine you want to check
· What region you are interested in
· What devices you want to see results for
· What keywords you want to check rankings for
· What site/s you want to check rankings for
In this respect you are controlling a limited number of options that determine what the crawler does. Some tools just crawl millions of keywords across every region and device constantly and then sell you access to the data.
Website Crawling for PPC
Website crawling tools are also a valuable asset for people doing PPC (Pay Per Click) such as when using Google Ads, although not as prevalent in day-to-day work. For example, when building a campaign or account you may want to use a crawler for any of the following tasks:
· Evaluate the structure & content of a website
· Find relevant landing pages
· Identify the target keywords of a site
· Make recommendations to optimise landing pages for target keywords
Website Crawling for PPC Tools
More common is the use of web crawlers in PPC tools, used in a similar way to those used in SEO, these tools collect data and you buy access to the data. For example, tools like Marin and Search Metrics use website crawling technology to collect data from the SERPs about ads. They feed this data into algorithms that assess it to provide competitive insights or recommendations.