VRC Monitoring Levels
Back to: VRC Approach
All six VRC stages may require or respond to information gathered at any of the context levels of the Web Resource Context Layer model (see Figure).
Data gathering first focuses on one or more elements at each level of context individually, e.g., HTTP header fields returned per page or site size based on the number of pages per site, to develop simple risk declarations, potential indicators of risk that are validated and refined through iterative data gathering.
A single indicator at one level of context may identify preservations risks, but the comprehensive and evolving set of potential risks will be construed from ongoing analysis of combinations of indicators at one or more of the levels. Risk scenarios can be constructed based on this analysis. For example, a Web site that may otherwise rate high on indicators of sound site management practice may be at risk due to the use of an outdated version of a Web server that may not be secure.
| VRC Monitoring Levels 1. A Web page as a stand-alone object, ignoring its hyperlinks |
![]() |
Page-level Monitoring
Based on our experience in crawling various collections, we defined a page
profile that a crawler can populate to create the baseline needed at the
evaluation stage, which then enables risk detection of change over time
through monitoring. A set of questions
should be addressed in extracting data from the temporal iterations of page
profiles.
Evaluating pages within a single site, across a range of sites, and over time provides a wealth of information. Important page-level risk indicators include:
- · integrity—HTML formatting and document structure, presense of broken links, 404 incidence
- · compliancy—such as specifications for markup languages, data formats, and metadata standards
- · change management—from incremental to radical, too much vs too little change
- · link structure—number and nature of links within and out of the site's server
The page-level information could also be used as a base of information for devising the risk–response pairs envisioned at the start of our project and for creating a basic typology of Web pages. We are defining relationships between the structural characteristics of a page (lots of text, lots of links, etc.) then linking those characteristics to typical roles of pages (informational, directional, portal access) and the probable rate of change. Confidence in this method can only be built through the evaluation of many examples. The table below illustrates the typology we imagine. Our results suggest that these types are likely, but we cannot yet substantiate these examples. We may define between 6 and 10 page types.
Basic Typology of Web Pages |
||
Page Type |
Characteristics |
Rate of Change |
| Informational | mostly text | moderate to slow |
| Directional | mostly links internal (most navigational) external (most informational) |
moderate—perhaps slower, often higher |
| Portal Access (deep Web) |
dynamic indicators (forms, scripts, etc.) | page itself—often slow underlying—high |
Site-level Monitoring
Using the page types, the next step is to construct models of Web sites
that reflect the basic structure of the sites—the size, the complexity,
the distribution of page types within the site and across the directories
of the site. It should be possible to determine rates of change at or below
the site level, i.e., more frequent crawls of some directories than others
based upon probable rates of change and potential value of the content based
upon page type. We have seen the potential in combining the functionality
of different types of tools to visually display the types of pages, the
rates of change over time, and the structure of the sites in ways that will
make the information more meaningful when assessing risk.
Server-level Monitoring
The environment in which a Web site runs can impact the site's reliability
and risk profile (e.g., Web server software and hardware vulnerabilities
to data loss and network disruption). Though such details are generally
hidden from remote probing, some characteristics that would enrich a site's
risk profile over time are available. Anything that can be learned is useful,
since problems affecting high-level components like networks and servers
can be catastrophic and may pose a threat to multiple sites.
Software utilities can determine whether a machine is reachable on the network and whether the Web server is responsive. Frequent outages and lengthy downtime may be indicative of poor maintenance. In some cases, specific information about server software in use is available. Server vulnerabilities put site content at risk from possible deletion or unauthorized modification. (See for example, our study on the uptake of a key server software security patch by over 100 test sites.)
Administrative Level
Work on the external administrative and environmental layers require agreements
with potential monitoring organizations and test sites. For example, monitoring
at the page level will allow us to identify indicators of the deep Web and
track those indicators over time, but monitoring deep Web content presumes
explicit permission as well as appropriate techniques.
Furthermore, it is important to distinguish between Web site management (Figure Layers 3 and 4) and physical site management (Figure Layer 5). The former entails managing site content and operations within a server’s software environment, primarily dealing with issues such as use of good design principles, adherence to standards, and software maintenance. Physical site management pertains to managing a server within a physical setting encompassing issues such as the degree to which hardware is secured from malicious acts (theft or vandalism), the use and maintenance of temperature, humidity, flood, and fire controls and alarms, the regularity of data backup, and redundancy measures. A site may have superb Web site management, but sloppy physical site management and therefore be extremely vulnerable.

