Web Crawler
View tools in the Web Crawler category
Intended purpose: To travel the web, following links, returning some sort of information about the pages visited or the pages themselves. They are also known as spiders and bots. Remote site mappers and link checkers are special-purpose kinds of web crawlers. Crawlers that return HTML pages and the other files associated with web pages can be used for mirroring sites for archival purposes. As they download pages, they can also make the characteristics of the pages and files available for analysis. Search engines are the most frequently-encountered web crawlers.
Potential for remote management: If one's strategy for remote management includes capturing and archiving at-risk web sites, a web-capturing crawler is a very necessary tool.
Why do we care? Among the range of responses to the idea of web sites being at risk, capturing and preserving sites at another location is a very active response.
Characteristics? All web crawlers are seeded with one or more URLs.
They then request the file pointed to. They parse the returned page, looking
for more links. They then follow those links in some order, continuing to
request pages and extracting more links. This process is fundamental to
many types of tools, including link checkers, search engines, web-capture
tools, site-mapping and visualization tools.
Web-capturing crawlers take advantage of the downloaded pages to create
indexes for searching or to save the pages and associated files to disk.
Other kinds of analysis are possible—the kind that creates site maps,
for example.
Relationship to other categories: (should this category be used
in conjunction with another category?)
Web crawling is the fundamental process by which all remote site-investigation tools work. Some crawlers are designed to download the pages of a site, perhaps saving them in a directory structure similar to that of the remote host; these are the crawlers we have segregated into this category. If you intend to capture pages for archiving, this kind of web crawler is what you need. The capturing function works without the aid of other crawling functions, although some web-site capture tools make their native link-checking function available, too.
High or Low impact: Can be either, depending on the way the crawling function is configured. The “politeness” factor—how many times per second the crawler engine visits a site, and how many bytes the crawler engine will accept from the host server each second—is user-configurable on the web-capturing crawlers we have seen.
Can these tools be used on other sites or only on your own site? The purpose of this category of tools is to visit other sites.
