Virtual Remote Control:
Test Site
Home |
Table of Contents |
Documentation
|
This funded project has ended, and this web site is no longer active or being updated. It is being retained for historical purposes only.
Links to published papers describing the background of this project appear at the bottom of the VRC home page. Test Site Introduction
What is this site and what is it for? This site is a "control site," in the sense that it provides a stable standard of comparison for use in testing and experimentation on websites. Every aspect of the site has been documented to provide a stable, known environment for testing a variety of website analysis tools. As currently configured, the site may be appropriate for testing web crawlers, link checkers, HTML validators, site mappers and visualizers, and site managers. It may not be appropriate for testing change detection and monitoring tools whose primary purpose is to report modifications to a site's content or coding, or tools designed to measure the response time of a site. The Cornell University Library Research Department has tested web crawlers and other tools by pointing them at various sites around the web. We got results, of course—lots and lots of pages and URLs and images. But we were never sure that our results were complete and accurate. We also couldn't be sure the crawlers were doing what we were expecting them to do. We decided to create a site whose contents were completely known to us so we could use it as a control for our crawls. The site contains html pages, images, multimedia files, scripts, deep directory structures, broken links and other kinds of errors, restricted pages, and more. Every page, every link, every resource is documented in a big xml file, documentation.xml, which is available for examination. For human readers, an HTML version of the documentation is available. There are also links to machine-readable transformations of the documentation—a text-only list of URI's for every element, for example. The sites we've crawled varied greatly in size. Our site has been streamlined to the point where it has only a couple of hundred elements. Even that number is too large for us to browse by hand, looking for discrepancies between what the crawlers find and what we know is on our site. Instead we use the xml documentation file to help automate analysis of the tools' output. When we test a website analysis tool, we feed it this test site's URL and turn on the tool's logging feature. When the tool finishes its job, we strip the log file of any information that isn't pertinent to our test. Then, using XSL, we transform our documentation XML into a format that reproduces the stripped-down log file. When we have massaged the two text files into the same format, we use standard UNIX file-comparison tools, such as diff, sort, and uniq, to determine the quality of the tool's results. An Example Here's an example using the open source, remote file capture tool, wget: We started our test crawl from the Unix command line with the command:>wget -k -o testsiteCrawl.log -r -l 100 http://prism.library.cornell.edu/control
The meaning of the options: Using a perl script, we extracted the urls from the wget log file: >getPageListFromWgetLog -s testsiteCrawl.log > testsiteCrawl.filelistWe used an xsl transformation of the documentation.xml file to produce a list of files on the testsite. >java org.apache.xalan.xslt.Process -in documentation.xml -xsl url.xsl
-out testsite.filelist Comparing the two file listsWe sorted both lists:>sort testsiteCrawl.filelist > testsiteCrawl.filelist.sorted>sort testsite.filelist > testsite.filelist.sortedWe counted the lines in each:
The files we used in the example How can I use the Documentation? How we are building the Test Site?
|
© IRIS Research Department, Cornell University Library, 2003
Send questions/comments/suggestions to vrc-testsite@cornell.edu