The Ultimate Guide to Web Crawling
If you are unfamiliar or very familiar with web crawling this guide will help to explain what web crawling is, how it works, why it’s necessary and what you can benefit from it.
What is a Web Crawling?
Web crawling is the automated process of browsing web pages on the internet and is a fundamental part of many core aspects of the internet. For example, Google and all search engines crawl web pages in order to build a list of pages that they can show in the search results.
When you search in Google for anything, the results that you see and then can click on all needed to be crawled.
It is not just search engines that crawl web pages, many sites and tools perform this job and for a range of reasons.
So, What is a Web Crawler?
The job of a web crawling is performed by a program called a web crawler, which is a piece of software designed to crawl the pages of a website. In the back end a web crawler is a set of scripts and code that define the parameters and nature of the crawl. Raptorbot, which is our web crawler uses scripts and code which looks like this:
Of course, you will not see this when using our tool or Google, instead the crawling is managed through an interface, known as a GUI (Graphical User Interface).
In the case of Google, all the crawling is all performed behind the scenes for web indexing, but you can get some visibility of this by using tools like Google Search Console. Search Console is an interface that provides you with information about how Google crawl your site, various search engines provide a comparable tool.
We cover Google Search Console in more detail in another guide, but this interface will show you:
- How many pages of a site Google’s web crawler (Googlebot) crawl per day
- How many Kilobytes are downloaded as a result of the crawl over time
- The time spent crawling a site
Although you can’t directly control Google’s web crawler, you can adjust and optimise to your site to affect these metrics.
At Raptor we provide a web crawler that you can control and use to crawl whatever you want, and so web crawlers come in various types and perform the same job for different purposes.
Crawling & Scraping
Web crawling itself is not the end game, typically a web crawler will also scrape data from the web pages that it crawls. What we mean by this is that when a web crawler crawls a web page it will collect information from and about that page. This process of collecting data is called ‘scraping’ or ‘web scraping’.
The screenshot above has three arrows that highlight some examples of the data that a web crawler might scrape:
Purple arrow = shows the page title or meta title for this web page
Green arrow = shows the URL of the page
Yellow arrow = shows the H1 Header
These are examples of the data a web crawler will collect (scrape) from a page, but they are not limited to just this… A web crawler can scrape pretty much any content, data or component from a web page. This data may not be visible to you when viewing a web page as it will be scraped from the source code of the page.
How do Web Crawlers Work?
Because web crawlers are software and no two pieces of software are the same there will always be differences in the details of how web crawlers crawl web content. That said, broadly speaking all crawlers work in a very similar way, the diagram below shows this process at its most basic:
Raptorbot for example, performs this is the following way and order:
- Look for a sitemap/s
- Crawl sitemap/s & extract all links
- Build a URL list or pages we will crawl
- Crawl links from sitemap/s
- Add any new links found on each page crawled to the list of links to crawl
- Rinse and repeat until the whole site has been crawled and all data scraped
If we cannot find a sitemap, we simply start by crawling the home page of a site.
The links found and the data we scrape are all stored in a database where it can be queried or called into reports, downloads, tables, charts and other visualisations.
There are other aspects to this process such as de-duplicating the list of URLs so that we are not crawling the same URL multiple times. There are also parameters such as setting the crawler to not crawl links that point to another domain, otherwise every crawl we do would result in us trying to crawl the entire internet!
What Cannot Be Crawled?
However, there are certain URLs that are uncrawlable to every web crawler, for example:
- A URL that requires a login to access
- A URL that does not exist
- A URL that is inaccessible for any reason
- A malformed URL
- A site using technology that prevents web crawlers
Typically, these will show up in reports as having a status code or response code that indicates the reason why it cannot be crawled. For example, a 404 error would mean that the page cannot be found, which means that there is a link to a page on site that is incorrect.
Making a site or a page uncrawlable to Google or other search engines will result in the site not being shown in the search results of said search engines. This is typically bad news for website owners but there are circumstances under which this is desirable.
Ethical Web Crawling
For the most part people don’t want you to crawl a site you do not own or work for unless the crawler is a search engine trying to find and index your content… Bots can skew traffic reports, take up server processing power and in extreme cases can slow down websites for actual users.
Some sites may not want you to crawl and scrape their prices or rates, as competitors can use this to systematically undercut those prices.
Following on from the above, what cannot be crawled, arises a different question about what should not be crawled. We mentioned that there are web technologies from companies such as Incapsula or Datadome that are designed to prevent web crawlers link Raptorbot from crawling websites.
These technologies are often used to prevent competitors from crawling a site to scrape pricing data or other data. It is completely possible to circumvent this and do it anyway, but it would or could be considered unethical.
Other server-side technologies exist that perform a similar function, where certain patterns or web page browsing once identified will be blocked. If for example, you crawl a site asynchronously (meaning you crawl multiple pages at the same time) from the same IP address you will look like a bot.
If you are using Raptor, you will need to add our IP and user agent (Raptorbot) to a whitelist so that this tech does not prevent us from crawling your site, as we only crawl ethically. If a site actively prevents us from crawling it, we will not crawl it.
Crawlers are often referred to as ‘bots’ which is short for ‘robots’…
Most websites also use a robots.txt file, this is a text file that can stipulate to bots what they should and shouldn’t crawl. For example, you could add the following text to a robots.txt file to tell Google not to crawl a specific directory:
You can also request that any bot or crawler not crawl your whole site:
Whether a web crawler respects this or not is up to the person doing the crawling but is not really regarded as unethical. But you may choose to crawl a site and ‘respect’ the robots.txt file to see how Google would crawl a site as they will allows respect the robots.txt file.
DOS (Denial of Service)
A DOS or Denial of Service attack is something that is used maliciously by hackers and criminals primarily as a means of extorting money from a business through what is essentially ransom. By blitzing a site with enough requests can result in the site being brought down as the hosting server becomes unable to handle to number of requests.
Taking down a site that makes millions of bucks a day in online trade, then demanding money to stop the DOS attack is a technique that criminals have tried over the years to get cash. If a site makes $5-Million a day in online sales and it would take 3 days to bring the site back online, a $1-million payoff may be an acceptable price for the business.
When building a web crawler, anyone doing the building will need to implement safeguards to ensure that their web crawling software does not make so many requests as to take a site down. Although this would be unintentional in most cases, the end result could cost a business a lot of money and would be both unethical and illegal.
Why Would You Need a Web Crawler?
Now that you know what a web crawling does and what web crawling is, you might wonder why you would want or need to utilise it. There are many reasons why web crawling is used, the most popular reason for web crawling (as a paid tool) is for SEO (Search Engine Optimisation). We cover this in the next section of this guide in more detail.
Web crawlers act as time savers, they automate work which if done by a human would take a very long time… Manually browsing a website to collect data in most cases would be a nightmare of mundanity that could bore a drill bit! If you need to extract all the page titles from a site with 100 pages, you could be looking at hours of work… On a site with 10,000 or a million pages, you would be looking at days to months of work. This becomes unmanageable after a point and the cost in man-hours involved would make it unaffordable.
Other Forms of Online Marketing
Other than SEO, there are various reasons why a tool like ours is used, most of which are related to online marketing. For example, if you are running paid advertising campaigns through AdWords, social media, or native advertising you will often have a list of URLs that act as landing pages for those campaigns.
When people click and ad in one of your campaigns, of any kind, on the net you will want to ensure that those pages are accessible and working as intended. Because you will be paying for clicks / traffic / people to come to (land) on those pages, you will be wasting money if they are inaccessible. If you have more than a handful of landing pages, checking them manually can take time.
Using a web crawler to check those pages for you saves time and makes the whole process easier.
Competitor Analysis & Research
Analysing competitor sites often feeds into a rage of processes and research pieces, whether it’s looking for keywords to target, content ideas, benchmarking, or market research. Web crawling can assist in collecting the data you need to make decisions or inform your online strategies.
Tracking & Monitoring
There are components of a website that have the power to prevent the site from appearing in the search results of Google. Monitoring and tracking these components regularly can take seconds if performed by a machine, meaning that unless you have an alert popping up telling you there is a problem, you can rest assured that nothing vital is wrong.
You can also keep a track of key pages on a site, ensuring uptime for high value content like pricing pages or top selling products.
Historical Crawl Data
As Google update their algorithm and websites get updates or makeovers regularly, keeping an historical record of crawls can be invaluable. If a site suddenly or gradually starts losing traffic or rankings you can often use historical crawl data can help to identify the cause.
Plotting certain metrics over time such as the volume of canonical pages, word count, indexable pages, etc can be correlated with traffic and ranking data. In the simple example below, we show organic traffic against the number of indexable pages over time:
If you saw this, the obvious first place to check would be the reason why the volume of indexable pages dropped so dramatically:
- Has a noindex tag been applied to a load of pages?
- Has the site removed or consolidated a lot of its pages?
- Is there a robots.txt issue preventing Google from crawling most of the site?
Web Crawling & SEO
As previously mentioned, SEO and the people in and doing SEO are the principle users of web crawlers, as the data scraped in crawls is used in a range of SEO processes.
Performing technical audits or optimising a site for its target keywords are processes performed ubiquitously in SEO.
For both processes you need to crawl a site, collect, structure and analyse the data to generate recommendations for a site.
You can’t perform a technical audit of a site without crawling and scraping the site. You cannot map keywords to pages without a list of pages and its easier to optimise meta data if you can see the current setup.
Benefits of Web Crawling
Raptor helps SEO’s in many ways, we’ve already covered some of the top-level reasons to use a web crawler, below we cover some of the more granular benefits provided by our web crawler.
Identify Indexation Issues
Indexation issues can prevent a page or even an entire website from showing in Google, and as such are incredibly important to identify.
Our SEO tools will check various components of a site and each page and report back to you as to whether there are any indexation issues. You can use this to isolate pages that require your immediate attention.
Find Broken Links
Almost every site will have at least a few broken links, a broken link is a link that once clicked takes you to an error page. There are many ways error types and none of these are desirable.
We provide a broken link report, which can be generated on any site that you have crawled in our web crawler tool. This report lists all the broken links we found on a site and where those links are located so that you can find and fix them.
Analyse Page Titles & Meta Data
Raptor collects all meta data and provides this in the SEO report which can be downloaded after a crawl has completed. You can use this data in several ways to either identify errors or optimise them to improve visibility within the SERPs (Search Engine Result Pages) for target keywords.
Improve Internal Linking
Internal links are like the veins and arteries of a site, they allow authority to flow throughout a site and are the only way for users to find your content when on your site.
Our SEO web crawler provides you with all the data you need to assess what pages are being linked to the most and the least. Additionally, we provide a range of filtered data to show you volumes of follow and nofollow links, unique links and total links to pages. You can use this data to improve the internal linking structure of your site.
XML sitemaps are key to quick and effective indexation, ensuring that you have the right pages listed, all necessary pages listed will help Google index your site more effectively.
We show which pages are listed in XML sitemaps, which aren’t and the URL of the sitemap they are located in.
Discover Duplicate Content
Duplicate content can be an issue for almost any site as it can be devalued by Google.
Using our web crawler, you can quickly and easily see how pages have duplicate meta data or canonical duplication issues. We also make it very easy for you to export this data into a presentable format such that you can easily fix these issues.
Resolve Canonical Issues
Canonical issues have the potential to prevent content from appearing in the SERPs, show the wrong content or simply confuse Google as to what you want them to show. As such, these issues are important to identify and resolve.
In addition to canonical duplication issues mentioned above, you can also use our web crawler tool to isolate a wide range of other canonical issues. One of the checks that our tool performs is to identify canonical pages vs non-canonical pages.
Benchmark Your Site Against Competitors
Data nearly always requires context, and more context equates to a higher resolution picture of your site and can directly inform strategic planning.
The competitor analysis feature that we provide, visually represents data so that you can easily see where your site its within the competitive landscape. Knowing that your site has 100 pages and 15,000 words is essentially meaningless outside of the competitive context… Is that better or worse that competitors? What volumes do high ranking pages have and how do you compare to that? We can answer all those questions and many more.
Redirections can, at certain levels, slow a site down, they can also cause issues with navigation, and prevent parts of a site from being accessed. Redirects are also essential when migrating a site or individual pages, so ensuring they are in place is essential to preserve past SEO investment.
The redirection tab in our reports give you everything you need, in the format you need it in, to identify problems, and ensure proper implementation.
Visualise Site Data
Tables of data are great, but charts and graphical representations of those data enable you to easily spot patterns, identify opportunities and analyse problems. We provide a suite of visualisations throughout our software to make it as easy as possible for you to understand your data.
Understanding your data is key to deriving insights and the ‘so what?’ that your management and clients need.
Who Uses Web Crawlers?
Web crawlers are used by a range of people and businesses, but these typically fall into one of the following groups.
Because you are not limited by the number of projects that you can create, even on our most basic pricing tier, you can easily manage 10’s to hundreds of clients.
Take stock of your digital assets with cost effective packages capable of crawling millions of web pages, our tool can handle high volume websites of all types.
SEO Consultants & Freelancers
Cost is always important for independent SEOs and Raptor provides some of the best value for money. With packages designed to be affordable while providing adequate URL limits for SEO freelancers, we are the obvious choice.
If you are particularly geeky, which we are, you might enjoy some of these white papers on web crawling:
- Web Crawler Research Methodology
- Study of Web Crawler and its Different Types
- Survey of the science and practice of web crawling
Click the link if you would like to trial our web crawler for free.