The SANDOBA//CRAWLER is a software as a service (SaaS) that automatically follows links on our customers' websites to continuously ensure that these links are still functional and point to the desired content. To do this, the SANDOBA//CRAWLER checks, among other things, whether the URL of a link can be reached without a redirect and with good performance, it analyzes the content of the main page of the domain or a specific subpage to classify the topic, and determines the URLs of relevant subpages such as Imprint / About Us to e.g. detect a transition of the linked website to a new operator.
In contrast to the robots of search engine providers or special tools for competitor analysis/search engine optimization (SEO), the SANDOBA//CRAWLER retrieves only a few pages of a website. As a rule, only URLs for the main page (https://www.example.com/), robots.txt, XML sitemaps or individual pages linked there, which identify the operator of the website (if available), are retrieved. Only the compressed HTML output of the URLs is retrieved, therefore the burden on the website is minimal and less than a single normal visitor. The crawler, like any visitor, has only access to the publicly visible information of the website.
The result of the scan is cached for some time (to prevent further queries during this period) and detected errors/problems are provided to our clients in various formats for the subsequent correction of links, etc.
There is no set schedule of when, if, or how often a retrieval of a specific web page occurs. This is dependent on specific crawling requests from our customers, who are free to decide on the type, scope and frequency of retrievals under our SaaS offering. However, by caching the results of the query of a web page, we ensure that, as a general rule, a maximum of 1 run per day takes place.
The SANDOBA//CRAWLER identifies itself with the following user agents when retrieving the URLs:
mozilla/5.0 (compatible; SandobaCrawler/1.0; +https://www.sandoba.com/en/crawler/)
mozilla/5.0 (compatible; SandobaCrawler/1.0; +https://www.sandoba.com/de/crawler/)
If you don't want the SANDOBA//CRAWLER to query any more subpages of your website, you can add the following note to the robots.txt of your website:
User-agent: SandobaCrawler
Disallow: /
The next time the website including the robots.txt file is retrieved, this note will be recognized and the website will be stored in a blacklist. From time to time, however, we may check whether the robots.txt still contains this note in order to be able to query web pages again if this should be desired.
SANDOBA//CRAWLER has only been active since 2022 and is constantly evolving. This includes the functionality of our crawling infrastructure, the speed/frequency of queries as well as the features of the associated SaaS offering. If you would like to learn more about it, please keep an eye on this page or send an email to crawler@sandoba.com.