How to Crawl a Site
In this guide we explain how to crawl a site that you have previously added to Raptor’s web crawler within a project.
The number of URLs we must crawl, and the load times of those URLs are the two biggest factors in how long it will take to crawl a project. With that in mind, we typically crawl at around 10 URLs per second per site being crawled.
Step 1: Login
Open a web browser and navigate to:
Once there, login with your Raptor Username and Password:
You can click on the ‘Remember Me’ tick box to save your details for future access on that device.
Then click on ‘Sign In’.
Step 2: Choose a Project
By clicking on the ‘project name’ link in the table shown in the screenshot below you can view a project:
Step 3: Click ‘Crawl’
By clicking on the ‘Crawl’ link in the table shown in the screenshot below you can view a crawl a single site:
Our web crawler is very quick, but you can set-and-forget with our software for large projects. You don’t need to keep logged in to have crawls run.
Step 4: See Crawl
You will be taken to the following page where you can see the status of the crawl and some data about the types of content & response codes we are identifying.
The charts and data are updated every second.
The process for our crawler is the following:
- Find xml sitemap/s
- Crawl xml sitemap/s
- Build list of URLs from sitemaps
- Crawl list of URLs
- Add new URLs found during crawl to the list
- Continue until all URLs that we can find have been crawled
- Analyse data in preparation for reports
If usage is very high you may see that some sites have been queued before crawling.
Step 5: See All Active Crawls
You can see the active crawls by clicking this in the side menu:
Once you have navigated to the ‘active crawls’ page, you will see all your current crawls and their progress:
A Bit More About Crawling, URLs & Usage
You can crawl any number of URLs, in a month, up to and including the number stated as the limit for your pricing plan. Crawling sites uses up URLs until your limit is reached.
URLs are not just limited to standard HTML web pages, they can also include:
- CSS files
- JS Files
- PHP Files
- Video Files
- External links
- Inaccessible Pages & Broken Links
- 4XX error pages
- 5XX error pages
- 301 Redirects
- 302 Redirects
- Canonical duplicates
- www / non-www
- http / https
- with and without trailing slash
- upper and lower case URL characters
You may also be interested in the below guides, which are also in the ‘crawling’ section of our support documentation.