Custom Web Crawling Options
At Raptor we have recently added in more functionality to our web crawler that allows our users to change crawl settings and options. This allows you to create customised crawls for a number of reasons and with significant benefits. Firstly, lets cover the settings themselves.
You will see these options when you:
- Add a project
- Edit the settings for a site
- Edit the settings for a project
- Add a new site to a project
The aim of these options is to provide you custom control over what gets crawled and what doesn’t get crawled during any crawl for each site. Every site can have its own custom crawl options specifically set to whatever your needs for that site are.
Benefits of Our Custom Web Crawler Options
Crawl Only What You Need To
Segment Site by region
Crawl Or Exclude Sub-Domains
Choose File Types to Crawl
Set Your Maximum Directory Depth
Website Crawl Options
The options look like this this throughout the software, no matter where you are setting them:
They are controlled by switched that show as green when set to ‘on’, in the example above (which are the default settings) the option for ‘Crawl only starting directory’ is set to on. The only exception to this at the moment is the maximum directory depth, which is determined by a number between 1 and 100.
We cover these options in more detail in the sections below.
Crawl All Sub-Domains
Switching this ‘on’, means that any sub-domains that we find during the crawl will be included in the crawl. If your site has multiple sub-domains, we will crawl them all if this option is turned on.
Crawl Only Starting Directory
This option is set to ‘on’ by default and means that we will not crawl any links found during the crawl that sit outside of the starting directory. For instance, if you have a multi-regional site with the URL of “example.com/uk/” we will only crawl URLs that sit within the UK directory. Meaning that if you had directories for other countries such as “example.com/us/” we would not crawl those when this option is turned on.
Also, worth noting is that if you enter a URL that contains a sub-domain such as “uk.example.com/“ we will crawl the sub domain and anything that sits within it but would not crawl other sub domains.
This option is turned off by default but can be easily switched to on by clicking the switch icon. Once switched on, this will instruct our web crawler to crawl any images that we identify during the crawl. This can however be restricted by other options, for example; if your images sit on a sub-domain that is not part of the URL entered and you have opted to not crawl all sub domains or to only crawl the starting directory… We will not crawl these images.
For the most part this option simply allows you to conserve your URL usage by not crawling images, which unless you need to for an SEO audit may not be relevant to you. We find that often each page of a site has at least one image, so you might find that a site of 500 web pages has over 1,000 URLs when you include images.
If you are looking to optimise images for any reason, such as to target keywords for effectively or improve page load times, we suggest crawling them.
Crawl CSS Files
CSS of Style Sheets are files that control how a site looks, often there are only a handful of CSS files on a site. However, some sites can have hundreds of these files, but rarely are they used in the auditing or SEO analysis of a website. As such this option is set to ‘off’ by default. This means that we will not follow and crawl any URLs that are CSS files.
Crawl JS Files
Render JS Pages
Maximum Directory Depth
This custom crawl option determines the maximum number of directories that the software will crawl down to. We set this as a default of 10 and a maximum of 100. For example, the URL below is two directories:
The URL below is five directories deep:
Typically, sites have less than 10 directories but if you want to limit your crawl to just down to category pages of a site that you know sit in the 3rd directory, you can specify this but setting the number to three.
Additionally, if you have a known issue with infinite loop issues on your site, this option will prevent a crawl from continuing until all your URLs for the current billing cycle have been used up before stopping.
Project Management and Crawl Settings
You can add multiple main sites within a project, this feature or functionality allows you to add multiple sites or multiple directories within a single TLD. The example below shows how you could structure this, with each regional version of a site added separately, using the ‘Crawl only starting directory’ option, each regional site will be crawled without crawling the other regions.
You can give each site a name that matches the region or language, or if your site is divided by sub domains to target regions, you can set it up this way to the get the same result. Equally if you have a site with multiple unrelated product ranges that are segmented or located in different directories, you can add them separately like this.
This functionality allows for a range of different real-world applications depending on your needs.
Adding New Sites
You can always add new sites or new variations of a site to a project, whether they are competitors or main sites. When doing this you can set the crawl option for each site as you add them in, the screenshot below shows how this looks when adding either main or competitor sites:
Site-Level Crawl Options
All crawl settings are set at the ‘site level’, meaning that each site can have a unique set of crawl options. These can be amended or changed at any point before a crawl is performed by clicking the site settings icon within a project or the project setting icon from the home page of the software.
The screenshot below shows what this icon looks like throughout the software, by clicking this you can change the crawl options for a crawl. Bear in mind that if you change these options, comparing historical crawl data could produce odd comparative results.
If you were to choose to crawl an entire site and all sub-domains and compared this to a crawl of a specific directory, the data has the potential to be radically different. This maybe what you want to look at, but it is typically a good idea to keep settings consistent for any one site. This helps when comparing or analysing crawl data over time in any meaningful way.
How Does This Benefit You?
At Raptor we are customer centric and as such are always looking for ways to improve the experience of using our SEO tools. Our primary focus is on giving our customers the functionality the need while improving the experience. We also want to ensure that our customers are not spending more money than they need to use our services.
These options allow you to limit crawls or open them up as wide as possible. If keeping your usage low is important to you, these crawl options allow you to limit the crawl to specific areas of a site or specific types of files within the site. Hence, this saves your crawl budget and allows you to allocate URLs where they are needed.
We are continually adding in new crawl options to improve the available functionality of our software and the user experience it provides. So, check back in regularly for updates and new releases!