Virtual Remote Control:
Test Site

Home | Table of Contents | Documentation

Test Site Introduction

Quick Links:
Table of Contents
Documentation
Protocol for Creating XHTML Files
Text-only List of Site URLs

What is this site and what is it for?

This site is a "control site," in the sense that it provides a stable standard of comparison for use in testing and experimentation on websites. Every aspect of the site has been documented to provide a stable, known environment for testing a variety of website analysis tools. As currently configured, the site may be appropriate for testing web crawlers, link checkers, HTML validators, site mappers and visualizers, and site managers. It may not be appropriate for testing change detection and monitoring tools whose primary purpose is to report modifications to a site's content or coding, or tools designed to measure the response time of a site. The Cornell University Library Research Department has tested web crawlers and other tools by pointing them at various sites around the web. We got results, of course—lots and lots of pages and URLs and images. But we were never sure that our results were complete and accurate. We also couldn't be sure the crawlers were doing what we were expecting them to do. We decided to create a site whose contents were completely known to us so we could use it as a control for our crawls.

The site contains html pages, images, multimedia files, scripts, deep directory structures, broken links and other kinds of errors, restricted pages, and more. Every page, every link, every resource is documented in a big xml file, documentation.xml, which is available for examination. For human readers, an HTML version of the documentation is available. There are also links to machine-readable transformations of the documentation—a text-only list of URI's for every element, for example.

The sites we've crawled varied greatly in size. Our site has been streamlined to the point where it has only a couple of hundred elements. Even that number is too large for us to browse by hand, looking for discrepancies between what the crawlers find and what we know is on our site. Instead we use the xml documentation file to help automate analysis of the tools' output. When we test a website analysis tool, we feed it this test site's URL and turn on the tool's logging feature. When the tool finishes its job, we strip the log file of any information that isn't pertinent to our test. Then, using XSL, we transform our documentation XML into a format that reproduces the stripped-down log file. When we have massaged the two text files into the same format, we use standard UNIX file-comparison tools, such as diff, sort, and uniq, to determine the quality of the tool's results.

An Example

Here's an example using the open source, remote file capture tool, wget:

We started our test crawl from the Unix command line with the command:
>wget -k -o testsiteCrawl.log -r -l 100 http://prism.library.cornell.edu/control
The meaning of the options:
-k—converts links for local viewing, turning absolute links in to relative ones.
-o—testsiteCrawl.log : writes the progress of the crawl to a file
-r—turns on the recursive crawling feature of wget
-l 100—specifies recursive retrieval to a depth of 100

Using a perl script, we extracted the urls from the wget log file:
>getPageListFromWgetLog -s testsiteCrawl.log > testsiteCrawl.filelist

We used an xsl transformation of the documentation.xml file to produce a list of files on the testsite.
>java org.apache.xalan.xslt.Process -in documentation.xml -xsl url.xsl -out testsite.filelist

Comparing the two file lists

We sorted both lists:
>sort testsiteCrawl.filelist > testsiteCrawl.filelist.sorted
>sort testsite.filelist > testsite.filelist.sorted

We counted the lines in each:
>cat testsiteCrawl.filelist | wc -l (136 lines)
>cat testsite.filelist | wc -l (135 lines)


We compared the two to find the extra line:
>cmp testsite.filelist.sorted testsiteCrawl.filelist.sorted
>testsite.filelist.sorted testsiteCrawl.filelist.sorted differ: char 6185, line 53

Line 53 in the filelist from the site itself is:
http://prism.library.cornell.edu/control/authBasic/authTest/authTestPage2.html

Looking at this region of the file, we see that the crawler didn't return
http://prism.library.cornell.edu/control/authBasic/authTest/authTestPage2.html or http://prism.library.cornell.edu/control/authBasic/authTest/index.html. Instead, it hit the .../control/authBasic/index.html file twice. When we look at that page in a web browser, we see that it is the entry point to a password-protected section of the site. When we alter the lines around line 53 in the testsite filelist to make it similar to the crawl filelist and compare them again, we find that there are no more discrepancies.

Our conclusion is that the configuration we used with wget captures all the files on the test site except for those available through a form. Furthermore, it doesn't report extra hits on any pages except for the password-prompting page.

The files we used in the example

The wget log
The perl script we used to extract captured file names from the wget log
The xml file: a list of all the files on the site at the time we created this example crawl. (Delivered as an html file)
The xsl file we used to transform the documentation

How can I use the Documentation?

To find a specific feature, open the HTML version of the documentation file. Search (probably Control-F) for the name of the thing you're looking for. Click on the link to the page exemplifying the feature.

How we are building the Test Site?
We are planning to include an example or two of every type of thing that exists on web sites. As we discover more, we will add more. Whenever we add something new, we also add it to the documentation file.
Refer to the Protocol for Creating XHTML Files


© IRIS Research Department, Cornell University Library, 2003
Send questions/comments/suggestions to vrc-testsite@cornell.edu