Data Extraction
Back to: VRC Approach | VRC Tool Box
VRC data extraction/analysis questions
November 3, 2003 (revised February 2004)Page-level
- What are the distinct attributes of individual pages?
- What are the most common combinations of attributes on pages (types of pages)?
- Does frequency of page types change by top domain? Other?
- HTTP Headers:
- What is the frequency of headers for test sites?
- Does the frequency of headers change by top domain? Other?
- Do headers change within sites? Number? Type? Format?
- Meta tags:
- What tags are used? Frequency?
- Are they well-formatted? Use current standards?
- Dynamic indicators:
- What indicators of dynamic behavior are used?
- What is the frequency of use?
- Does type or frequency change by top domain? other?
- MIME Types:
- What MIME types occur?
- With what frequency?
- Issues with mime types?
- HTML or other markup language:
- What ML is used? Current version?
- Frequency of non-HTML ML's?
- Does the frequency change by top domain? other?
Pages with links
- What is the frequency of links within sites? elinks, ilinks
- What is the average number of links per page? min, max, mean
Site level
- What is the frequency of page types per site?
- Does the frequency of types change by top domain? Other?
- What is the location of the library gateway?
- What access points to deep web are there (: dynamic indicators)?
- Site Profile: number of pages; size - total, average page size; web server software
Site on a server
- What web server software/version in use by top domain?
- What are the characteristics of the Web server: cost, reliability, etc.?