Cookies disclaimer

I agree Our site saves small pieces of text information (cookies) on your device in order to deliver better content and for statistical purposes. You can disable the usage of cookies by changing the settings of your browser. By browsing our website without changing the browser settings you grant us permission to store that information on your device.

Website Scraper

Raptor helps your business increase online sales, revenue and profits through effective SEO.

Our suite of SEO tools which includes functions like web crawling and website scraping so that we can compile, aggregate, analyse and process the scraped data into tables and charts. We do this so that you can perform SEO Audits, Technical Audits, assess page speed and perform competitor analysis.

We also compile this data into the SEO reporting section of our software where you can visualise and analyse the data.

Scraping websites for data is most efficient when run from servers rather than your own computer. We have developed a robust cloud-based website scraper that allows you to quickly and easily scrape websites for any data that you want or need for a range of SEO processes such as those described above.

 

 

 

Website Scraping Layman’s Terms

Our web scraper is an automation tool, the easiest way to explain this is to imagine you wanted to build a list of all the web pages and resources like images and files from a site… This would be a very manual process where you would need to use a web browser to look at each page, copy the URL and put it in a spreadsheet or document.

Now imagine you want to know the number of words on each page, you would need to either count them or use some tool to establish the word count for every page. The more data you want to extract from a page, the bigger the job, on a site with more than a few pages this would become a time heavy and boring process. Our web scraper does all of this for you.

 

Benefits of Raptor's SEO Website Scraper

Easily Crawl Websites

Scrape All Website Data

Scrape All SEO Data

Download Data in Various Formats

View & Analyse Scraped Data Online

Cloud-Based Web Scraper

 

Why is Cloud-Based Website Scraping Best?

Installing a web scraper on your computer comes with various disadvantages over a cloud-based alternative:

 

Desktop Based Website Scrapers

  • Any problem with your computer and you can’t scrape
  • You can only scrape from the computer with the program installed
  • All requests come from your computer’s IP address (which can get blocked by anti-scraping technologies)
  • You often need to buy and implement your own proxy servers to scrape from multiple IP addresses
  • Updates to the operating system have the potential to prevent the program from functioning as intended
  • Typically, no analysis is performed on the raw data that you export (less automation = more work for you)
  • You need to have your computer and with internet access for the full duration of a crawl
    • If your computer crashes, the crawl fails
    • If your computer turns of or restarts, the crawl fails
    • If the internet drops out, the crawl fails
    • If the program crashes, the crawl fails
  • Very inefficient on larger sites

 

Cloud-Based Website Scraping

  • Set and forget, setup a crawl and come back when we email you to let you know it’s complete
  • You don’t need to have the web page open for the crawl to complete
  • No processing, memory or hard drive usage for you during crawls
  • All you need is a web browser, no operating system or app conflicts
  • Access your account from multiple devices, including your phone
  • Let us worry about IP addresses
  • The most efficient way to scrape website data on sites of any size
  • Schedule crawls
  • Past crawls are all archived and easy to access at any time and from any location

 

 

Scraped Web Data

There are many reasons and situations why you would need to use a website scraper, no matter why you need to scrape data, Raptor provides you with an easy to use web scraping tool. Typically, our users are scraping data for SEO, whether it’s a competitor site or their own, Raptorbot scrapes all the SEO data you need.

The below is an up to date list of the different types of data that we scrape from a site, on a per URL basis.

The data shown in the table below is the raw data that we scrape from websites, this is available in various downloadable formats such as custom CSV files and pre-formatted Excel spreadsheets. We describe below the raw web scraper data.

 

Web Crawler - Excel report exporting for seo audits

 

  • URL – of all pages, images, videos and resources.
  • File Type – Such as HTML, CSS, Jpeg, SWF, etc.
  • Status – The status code returned by a URL, such as 200, 301, etc.
  • Indexable – HTML/Text pages that are not restricted by a robots.txt or meta robots tag from being indexed and have a status code of 200.
  • Non-Indexable – Pages that are not indexable due to robots.txt or meta robots’ tags, or a status code other than 200.
  • Crawlable – Pages and resources that are not disallowed by the robots.txt.
  • Canonical – HTML Pages with a self-referential canonical tag.
  • Non-Canonical – HTML pages with a canonical tag that links to another page / URL.
  • Canonical URL – The URL within the canonical tag.
  • Page Title – The page title or meta title of each page.
  • Page Title (Length) – The number of characters including punctuation and spaces of the page title.
  • Meta Description – This is scraped from every page.
  • Meta Description (Length) – The number of characters including punctuation and spaces of the meta description.
  • Meta Keywords – This is scraped from every page.
  • Meta Keywords (Length) – The number of characters including punctuation and spaces of the meta keywords.
  • Implemented GA Tracking – Whether tracking code is implemented in some form on every HTML page.
  • UA Number (First) – The first UA number (For Google Analytics) identified on each page.
  • UA Number (Second) – The first UA number (For Google Analytics) if present on a page.
  • OG Tags – We scrape all Opengraph Facebook tags on each page.
  • Twitter Cards – We scrape all Twitter Card tags on each page.
  • Google+ Tags – We scrape all Google+ tags on each page.
  • H1 (First) – The first H1 header from each HTML page.
  • H1 (Second) – The second H1 header from each HTML page.
  • H2 (First) – The first H2 header from each HTML page.
  • H2 (Second) – The second H2 header from each HTML page.
  • H2 (Third) – The third H2 header from each HTML page.
  • H2 (Fourth) – The fourth H2 header from each HTML page.
  • H2 (Fifth) – The fifth H2 header from each HTML page.
  • Other H tags – We scrape all header tags on each page.
  • Word Count – Number of words on a HTML/Text page
  • Text Ratio – The ratio of text to code on each HTML page.
  • URL Length – The number of characters in each URL.
  • Page Depth – The depth of a page within the structure of the site.
  • Redirect to – Where a redirect exists, this identifies the URL it redirects too.
  • Linked from XML Sitemap – Yes or No.
  • In links – The number of links pointing to each URL.
  • Unique In links – The number of unique links (one per page) pointing to each URL.
  • Follow In links – The number of ‘follow’ links pointing to each URL.
  • Outlinks – The number of links pointing to other pages on the same domain, for each URL.
  • Unique Outlinks – The number of unique links pointing to other pages on the same domain, for each URL.
  • Follow Outlinks – The number of ‘follow’ links pointing to other pages on the same domain, for each URL.
  • External Links – The number of links pointing to another domain, for each URL.
  • Response Time – Ms (Milliseconds).
  • Size (Kb) – of all pages, images, videos and resources.

 

 

This data is exportable from the reporting section of the software we provide and you can download this to various formats:

 

Web Crawler - Create custom CSV exports for seo audits

 

Website Scraper Summary Data

We also process and analyse this data such that we can summarise it for you, you will be able to see all of this within the reporting section of the web scraper tool. Using the raw data shown in the table above we produce an online viewable version of the data shown in the table below:

 

Data category

Check or Data

Description of Data

Crawl Summary Data

No. of URLs Crawled

The number of URLs crawled is the total number of URLs that were actually crawled during the crawl of your site. This includes all file types such as HTML pages, images, CSS files, etc. This also includes URLs that return a response code of 3XX, 4XX, 5XX or any other response code.

If a crawl was paused or interrupted this number will not match the number of URLs found during a crawl.

Crawl Summary Data

HTML Pages

HTML pages are typically the standard webpage that you would expect to view on a website. This metric shows the number of these pages we crawled during the crawl of your site.

Crawl Summary Data

Non-HTML Pages

HTML pages are typically the standard webpage that you would expect to view on a website. This metric shows the number of pages that are not HTML that we crawled during the crawl of your site. This can include URLs such as images, CSS files, JavaScript files and PDFs.

Crawl Summary Data

Indexable Pages

Indexable pages are pages that can be indexed by Google, in terms of the checks we perform these are pages that do not have a noindex tag present and are not disallowed from the robots.txt file.

Crawl Summary Data

Non-Indexable Pages

Non-indexable pages are pages that cannot be indexed by Google, in terms of the checks we perform these are pages that either have a noindex tag present and / or are disallowed from the robots.txt file.

Crawl Summary Data

Status 200 Pages

Status 200 pages are URLs that return a response / status code of ‘200’, which means that they are accessible to users and robots.

Crawl Summary Data

Non-Status 200 Pages

Non-status 200 pages are URLs that return a response / status code that is not ‘200’, which means that they are inaccessible to both users and robots. The reason for this inaccessibility may vary from the page redirecting to a server error.

Indexation Data

Noindex Pages

Noindex pages are pages that have a noindex tag present on them. This tag will prevent the page form being indexed by Google and thus will prevent them from appearing in the organic search results.

Indexation Data

Disallowed URLs

Disallowed URLs are URLs that have been disallowed from the robots.txt file. This tag will prevent the page form being crawled by Google and thus will prevent them from appearing in the organic search results.

Indexation Data

3XX Pages

3XX pages are URLs that are redirected to another page, they can have different status / response codes such as 301 which is a permanent redirect or 302 which is a temporary redirect.

Indexation Data

4XX Pages

4XX pages are pages that return a status code beginning with 4, which means that they are inaccessible typically due not it not being found or not existing. These are errors and need to be resolved.

Indexation Data

5XX Pages

5XX pages are URLs that return a server error and are consequently inaccessible, these are errors and need to be resolved.

Canonical Data

Canonical Pages

Canonical pages have a canonical tag with a self-referential canonical link, these pages are the preferred version of pages on a site. For example, if a site is accessible with and without the ‘www’ we would apply a canonical tag to each page specifying which one is the preferred version. If a page is accessible from multiple URLs, canonical tags are used to determine which is canonical.

Canonical Data

Non-Canonical Pages

Non-canonical pages have a canonical tag that links to another URL, these pages are typically not going to appear in the search results. If a page is accessible from multiple URLs, canonical tags are used to determine which is canonical and which is not.

Canonical Data

Missing Canonical Tag

Every page should contain a canonical tag and have a canonical URL specified within that tag. This helps to ensure that the preferred version of content is shown in the search results.

Ensure that every HTML page on your site contains a canonical tag.

Canonical Data

HTTP URLs

HTTP URLs are URLs that do not use a secure protocol.

Canonical Data

HTTPS URLs

HTTPS URLs are URLs that use a secure protocol.

Canonical Data

WWW URLs

‘WWW URLs’ are URLs that are accessible using the ‘www’ such as https://www.example.com

Canonical Data

Non-WWW URLs

‘Non-WWW URLs’ are URLs that are accessible without using the ‘www’ such as https://example.com

Canonical Data

Trailing Slash URLs

‘Trailing Slash’ URLs are URLs that include a trailing slash at the end of the URL such as: https://example.com/page/

Canonical Data

Non-Trailing Slash URLs

‘Non-Trailing Slash’ URLs are URLs that do not include a trailing slash at the end of the URL such as: https://example.com/page

Canonical Content

Thin Content Pages

Canonical pages with less than 100 words of content are considered 'thin content' pages, pages with thin content are unlikely to rank as well as they could in the organic search results. Canonical pages are the version content that Google will typically prefer to show the search results so ensuing hat these have enough content is essential for organic visibility.

Consider adding content to these pages if they are valuable landing pages, a minimum of 250 words is recommended.

Canonical Content

Nearly Thin Content Pages

Canonical pages with less than 250 words of content could be considered 'thin content' pages, pages with thin content are unlikely to rank as well as they could in the organic search results. Canonical pages are the version content that Google will typically prefer to show the search results so ensuing hat these have enough content is essential for organic visibility.

Consider adding content to these pages if they are valuable landing pages, a minimum of 250 words is recommended.

Canonical Content

251 to 500 Words

Canonical pages with 251-500 words of content have a sufficient amount of content present on them. This data is just informational and is used to show the distribution of content across a site.

Canonical Content

500 to 1,000 Words

Canonical pages with 501-1,000 words of content have a sufficient amount of content present on them. This data is just informational and is used to show the distribution of content across a site.

Canonical Content

1001 to 2,000 Words

Canonical pages with 1,001-2,000 words of content have a significant amount of content present on them. This data is just informational and is used to show the distribution of content across a site.

Canonical Content

Over 2,000 Words

Canonical pages with more than 2,000 words of content have a large amount of content present on them. This data is just informational and is used to show the distribution of content across a site.

Linking Data

Follow Links

Follow in links are internal links that are set to ‘follow’, which means that they pass on authority. This is great if you are linking to pages that are indexable or that you want to promote within your site. We show this as just data so that you can see which pages are linked to the most by follow links

Linking Data

Nofollow Links

Nofollow links do not pass on authority and can be used for many legitimate reasons, we flag this as part of our summary data to keep you informed. This also provides you with the chance to review these links to see if they are legitimate.

Linking Data

Isolated Pages

Pages with less than 4 follow in links are poorly linked to, this will reduce the amount of authority they have and also impede user navigation to these pages. This is especially true for canonical pages as these pages are by definition, pages that you want indexed by Google and contain canonical content.

You should consider adding more internal links pointing to these pages.

Content Data

Total Words

This is the total word count across the entire site regardless of whether the page is canonical or indexable.

Content Data

<100 Words

Pages with less than 100 words of content are considered 'thin content' pages, pages with thin content are unlikely to rank as well as they could in the organic search results. "Canonical pages with less than 250 words of content could be considered 'thin content' pages, pages with thin content are unlikely to rank as well as they could in the organic search results.

Consider adding content to these pages if they are valuable landing pages, a minimum of 250 words is recommended. Consider adding content to these pages if they are valuable landing pages, a minimum of 250 words is recommended.

Content Data

101 - 250 Words

Pages with 251-500 words of content have a sufficient amount of content present on them. This data is just informational and is used to show the distribution of content across a site.

Content Data

251 - 500 Words

Pages with 501-1,000 words of content have a sufficient amount of content present on them. This data is just informational and is used to show the distribution of content across a site.

Content Data

501 - 1,000 Words

Pages with 1,001-2,000 words of content have a significant amount of content present on them. This data is just informational and is used to show the distribution of content across a site.

Content Data

1,001 - 2,000 Words

Pages with more than 2,000 words of content have a large amount of content present on them. This data is just informational and is used to show the distribution of content across a site.

Content Data

> 2,000 Words

This is just informational and shows the distribution of images across a site.

Content Data

Total Images

This is just informational and shows the distribution of images across a site.

Content Data

0 Images

This is just informational and shows the distribution of images across a site.

Content Data

1 to 5 Images

This is just informational and shows the distribution of images across a site.

Content Data

6 - 10 Images

This is just informational and shows the distribution of images across a site.

Content Data

11 - 15 Images

This is just informational and shows the distribution of images across a site.

Content Data

16 - 20 Images

This is just informational and shows the distribution of images across a site.

Content Data

Pages with >20 Images

This is just informational and shows the distribution of images across a site.

Page Speed Data

< 1 Sec Pages

Pages that fall into this category are fast loading.

Page Speed Data

1 - 2 Sec Pages

Pages that fall into this category are under the load time of 2 seconds, which Google recommend as the maximum load time of pages.

Page Speed Data

2 - 3 Secs Pages

Pages that take between 2 to 3 seconds to load are slow and should be reviewed. Google places significant weight on page load times due to the impact of page speed on the user experience.

Review the pages identified in this check for page speed issues.

Page Speed Data

3 - 4 Secs Pages

Pages that take between 3 to 4 seconds to load are very slow and should be reviewed. Google places significant weight on page load times due to the impact of page speed on the user experience.

Review the pages identified in this check for page speed issues.

Page Speed Data

4 - 5 Secs Pages

Pages that take between 4 to 5 seconds to load are very slow and should be reviewed. Google places significant weight on page load times due to the impact of page speed on the user experience.

Review the pages identified in this check for page speed issues.

Page Speed Data

> 5 Secs Pages

Pages that take between 5 or more seconds to load are extremely slow and should be reviewed. Google places significant weight on page load times due to the impact of page speed on the user experience.

Review the pages identified in this check for page speed issues.

Meta Data

All Meta Description Pages

A list of all meta descriptions across the whole site.

Meta Data

Missing Meta Description Pages

Meta descriptions are an indirect ranking factor as they have a powerful impact on Click Through Rate (CTR). They should encourage people to click on your organic listings and reflect the content that the user will find on the page. Failing to specify a meta description means that Google will use content from the page within the search results. This will often not be optimised to encourage CTR and can lead to reduced organic visibility.
Ensure that all meta descriptions are both present and unique across all pages that you want to appear in the search results.

Meta Data

Long Meta Description Pages

This data shows meta descriptions that exceed 160 characters and are hence, ‘too long’. Meta descriptions that exceed the character limit of 160 characters will be truncated within the search results and so the extra content over the character limit will not show.

Optimise meta descriptions within the character limit.

Meta Data

Short Meta Description Pages

Short meta descriptions are often a sign of a poor meta descriptions. Although not always the case, they often do not sell the product or encourage click through rate to the page. Providing more Information about your service or the page content can make the difference between someone clicking and someone going somewhere else.

We recommend utilising the full amount of characters so that the user is more encouraged to click on your listing.

Meta Data

Multiple Meta Description Pages

This data shows which pages have more than one meta description present on them. Google can only show one meta description per page and so having multiple meta descriptions present on a single page is by definition an error.

Review the pages identified and remove all bar one of the meta descriptions. Choose the best meta description or create a new single meta description for each page.

Meta Data

Duplicate Meta Description Sets

Canonical pages should have unique meta descriptions. Meta descriptions are an indirect ranking factor as they have a powerful impact on Click Through Rate (CTR). They should encourage people to click on your organic listings and reflect the content that the user will find on the page.

Ensure that all meta descriptions are unique across your canonical pages.

Meta Data

All Page Title Pages

A list of all page titles across the whole site.

Meta Data

Missing Page Title Pages

This dataset identifies pages that do not contain a page title. Page titles are factored into Google's ranking algorithm and are powerful on-page SEO components. Page titles should contain the target keyword/s for the page and each page should target unique sets of keywords.

Ensure that all pages have a page title. Even non-indexable landing pages used for paid advertising should typically have page titles.

Meta Data

Long Page Title Pages

This data shows page titles that exceed 70 characters and are hence, ‘too long’. Page titles that exceed the character limit of 70 characters will be truncated within the search results and so the extra content over the character limit will not show.

Optimise page titles within the character limit.

Meta Data

Short Page Title Pages

Short page titles are often a sign of a poor title. Although not always the case, they often do not encourage click through rate to the page. Providing more Information such as relevant keywords can assists in improving click through rate.

We recommend utilising the full amount of characters so that the user is more encouraged to click on your listing.

Meta Data

Duplicate Page Title Sets

Canonical pages should have unique page titles. Page titles are factored into Google's ranking algorithm and are powerful on-page SEO components. Page titles should contain the target keyword/s for the page and each page should target unique sets of keywords.

Ensure that all page titles are unique across your canonical pages.

Google Analytics Data

Missing GA Code Pages

This check identifies pages that do not have Google Analytics (GA) code present on them. If your site uses Google Tag Manager or another Analytics package altogether, this will not be an error. If however, your site does use Google Analytics tracking code to track users, the pages identified in this check will not be tracking users due to the code being missing.

Review these pages and ensure that the lack of GA code is intentional. If it is not intentional, add the tracking code to these pages.

Google Analytics Data

Multiple GA Code Pages

Google Analytics (GA) code facilitates the tracking of users on a site, having multiple iterations of GA code on a single page can cause tracking issues. This can result in lost data, fragmented data or double counting of user data. Depending on the number of GA code iterations on a page, this can also affect load times.

Ensure that each page has only a single iteration of GA code present.

Google Analytics Data

Legacy GA Code Pages

Included within this check are pages that contain legacy Google Analytics (GA) code, meaning that this is old code and needs to be updated.

Consult the instructions within your Google Analytics account for the latest version of the tracking code and implement this on the pages identified within this check.

Google Analytics Data

Missing Tag Manager Code Pages

This check identifies pages that do not have Google Tag Manager code present on them. If your site uses Google Tracking Code or another Analytics package altogether, this will not be an error. If, however your site does use Google Analytics tracking code to track users, the pages identified in this check will not be tracking users due to the code being missing.

Review these pages and ensure that the lack of Tag Manager code is intentional. If it is not intentional, add the Tag Manager code to these pages.

Google Analytics Data

Multiple Tag Manager Code Pages

Tag Manager code facilitates the tracking of users on a site, having multiple iterations of Tag Manager on a single page can cause tracking issues. This can result in lost data, fragmented data or double counting of user data. Depending on the number of Tag Manager iterations on a page, this can also affect load times.

Ensure that each page has only a single iteration of Tag Manager code present.

 

 

When you view this data online in the reporting section of our SEO tool, you will see it presented like this:

 

Web Crawler - Report data and drill-down details for seo audits

 

Why Use Our Web Scraper?

This scraped data helps you to perform a range of SEO functions:

  • Technical audits
  • Optimisation audits
  • Competitor analysis
  • Keyword Research (Competitive Data)

Unlike some of our competitors we provide some analysis for you, most notably we tell you which pages are indexable based on a range of criteria (see above). We also tell you which URLs are canonical, this is based on whether the URL crawled matches exactly the URL specified within the canonical tag. For example:

Canonical URL = https://example.com/page/

  • https://www.example.com/page/ = Not canonical (uses www)
  • http://example.com/page/ = Not canonical (not https)
  • https://example.com/another-page/ = Not canonical (is a completely different URL)
  • https://example.com/page = Not canonical (not using a trailing slash)
  • https://example.com/Page/ = Not canonical (Uses a capital letter)

 

Check out the video below to see some of the benefits that our website scraper has to offer in terms of reporting on the data that we scrape and analyse.

 

 

 

 

SEO WEB CRAWLER - FREE 30-DAY TRIAL!

30-Day Free Trial of our SEO Web Crawler Now Available, sign up with a valid email address and your name below to get instant access. No Credit Cards Required.