Cookies disclaimer

I agree Our site saves small pieces of text information (cookies) on your device in order to deliver better content and for statistical purposes. You can disable the usage of cookies by changing the settings of your browser. By browsing our website without changing the browser settings you grant us permission to store that information on your device.

Crawl a Site

How to Crawl a Site

In this guide we explain how to crawl a site that you have previously added to Raptor’s web crawler within a project.

The number of URLs we must crawl, and the load times of those URLs are the two biggest factors in how long it will take to crawl a project. With that in mind, we typically crawl at around 10 URLs per second per site being crawled.

 

Step 1: Login

Open a web browser and navigate to:

https://tools.raptor-dmt.com/

Once there, login with your Raptor Username and Password:

 

Sign in to Raptor Web Crawler

You can click on the ‘Remember Me’ tick box to save your details for future access on that device.

Then click on ‘Sign In’.

 

Step 2: Choose a Project

By clicking on the ‘project name’ link in the table shown in the screenshot below you can view a project:

Choose a Project

 

Step 3: Click ‘Crawl’

By clicking on the ‘Crawl’ link in the table shown in the screenshot below you can view a crawl a single site:

Crawl All Sites in Project

Our web crawler is very quick, but you can set-and-forget with our software for large projects. You don’t need to keep logged in to have crawls run.

 

Step 4: See Crawl

You will be taken to the following page where you can see the status of the crawl and some data about the types of content & response codes we are identifying.

Crawl All Sites in Project

The charts and data are updated every second.

The process for our crawler is the following:

  1. Find xml sitemap/s
  2. Crawl xml sitemap/s
  3. Build list of URLs from sitemaps
  4. Crawl list of URLs
  5. Add new URLs found during crawl to the list
  6. Continue until all URLs that we can find have been crawled
  7. Analyse data in preparation for reports

If usage is very high you may see that some sites have been queued before crawling.

 

Step 5: See All Active Crawls

You can see the active crawls by clicking this in the side menu:

Click Active Crawls

Once you have navigated to the ‘active crawls’ page, you will see all your current crawls and their progress:

See Active Website Crawls

 

A Bit More About Crawling, URLs & Usage

You can crawl any number of URLs, in a month, up to and including the number stated as the limit for your pricing plan. Crawling sites uses up URLs until your limit is reached.
URLs are not just limited to standard HTML web pages, they can also include:

  • Images
  • CSS files
  • JS Files
  • PHP Files
  • Video Files
  • External links
  • Inaccessible Pages & Broken Links
    • 4XX error pages
    • 5XX error pages
  • Redirects
    • 301 Redirects
    • 302 Redirects
  • Canonical duplicates
    • www / non-www
    • http / https
    • with and without trailing slash
    • upper and lower case URL characters

 

Related Content

You may also be interested in the below guides, which are also in the ‘crawling’ section of our support documentation.