A powerful C# web crawler that makes advanced crawling features easy to use.
- Crawl multiple sites concurrently
- Pause/resume live crawls
- Simplified pluggability/extensibility
- Avoid getting blocked by sites
- Automatically tune speed/concurrency
Parallel Crawler Engine
A crawler instance can crawl a single site quickly. However, if you have to crawl 10,000 sites quickly you need the ParallelCrawlerEngine. It allows you to crawl a configurable number of sites concurrently to maximize throughput.
Easy Override allows you to easily plugin in any implementation of a key interface in an easy to use object wrapper that handles nested dependencies for you. No matter how deep.
Pause And Resume
There may be times when you need to temporarily pause a crawl to clear disk space on the machine or run a resource intensive utility. No matter the reason, you can confidently Pause and Resume the crawler and it will continue on like nothing happened.
Most websites you crawl cannot or will not handle the load of a web crawler. Auto Throttling automatically slows down the crawl speed if the website being crawled is showing signs of stress or unwillingness to respond to the frequency of http requests.
Its difficult to predict what your machine can handle when the sites you will crawl/process all require different levels of machine resources. Auto tuning automatically monitors the host machine's resource usage and adjusts the crawl speed and concurrency to maximize throughput without overrunning it.