Archiving of Internet Content

Archiving of Internet Content

1. Introduction

Earlier non-Internet-based digital media also have conservation issues, and this conservation or preservation is the explicit archiving task we are most accustomed to. But the ephemeral nature of Web content poses a new dilemma. How do we preserve the information on something like the Web, where an immense quantity of material is here today and gone tomorrow, left in the hands of individual providers who may erase it at any time? Web archiving poses a combination of old problems with its own unique difficulties. The information explosion of the Web is comparable to the explosion of publication in previous eras, but the sheer quantity of material vastly exceeds anything that has come before.

These are the questions raised by several recent developments on the Web. The Internet Archive, a project launched by several computer scientists and the Library of Congress, seeks to preserve “snapshots” of the Web at various points in its history. Another group spawned from the Computer Science and Telecommunications Board of the National Research Council, Committee on 21st Century Systems, took on a more academic preservation role at its Digital Documents in Science and Engineering project. These and other similar projects at least have the effect of posing the question of Web content preservation. By turning attention to the “vanishing” nature of Web-based information, they make a strong claim that it is worth saving.

The World Wide Web is a valuable cultural artifact; a feat of modern society equal to the construction of the great Library of Alexandria. Much like that ancient repository of knowledge, the Web offers both knowledge and nonsense, occasionally at the same time. And it, too, is fragile. Nevertheless, it is something very new under the sun. What is to become of its contents? What will be left for those who seek to understand the dawn of the information age?

2. Importance of Archiving Internet Content

The Internet is a global information resource. It has revolutionized access to information and provides many forms of data, but information stored on the Web today can change or disappear with no record; hence, people or systems that look for information stored on the Web are often left with a “File Not Found” error. The average lifespan of a web page is between 44 – 75 days; data in databases goes out of date or gets purged after a set time period, or a website admin may take a site down and replace it with a new page. Internet content is vulnerable and it is in danger of being lost in specially volatile areas such as news and current affairs, commercial and financial data, and sectors of society. This often happens because material is “published and purged”, that is not stored in any physical form and is only available for the public to access for a limited time. Material published in scholarly and scientific journals is also at risk, accidentally or deliberately being removed, and the absence of a wayback functionality for dynamic pages means that it might not be possible to browse archived content with respect to how it was generated or posted. Archiving can help preserve this information.

A picture from U.S. Army Corps of Engineers Digital Visual Library. The photo is titled “Contraband Found on Porters After Severe Punishment.”

3. Challenges in Archiving Internet Content

In recent years, several organizations have been involved in archiving internet content. For example, the Internet Archive has been archiving web content for nearly 10 years and has amassed a collection of hundreds of millions of resources. Part of the success of the Internet Archive is due to the fact that much of the early web was static. Content was served up as HTML files, images, and video and was relatively easy to capture. The web of today is vastly different – dynamic database-backed web sites are prevalent as seen by the rise in popularity of sites using content management systems like PHP-Nuke, PostNuke, and Microsoft SharePoint. A study has shown that in 2003, out of 100,000 popular news sites on the web, only 10% of the content is static, the rest being generated from a database when the page is viewed. Archiving a web site that uses a content management system is difficult enough, but some web sites that change the state of their pages and the content within them in response to user interaction are nearly impossible to capture or recreate with any level of success. An example would be a travel site that brings back results of flight availability and prices based on user input, the pages that are displayed are often not stored and if they are it is in a temporary location and will be deleted after a period of time. This type of site is in a sense, transient, with no user interaction the content that was there at a particular point in time may no longer be there the next time the page is viewed- what the archivist has collected is an inaccurate representation. Drop-down menus and forms have similar problems; if the content is generated from an external source there is no guarantee that it will be available at a later date. With user interactive technologies constantly evolving and becoming more complex, the problem of accurately capturing web sites like these is only going to get harder.

4. Methods and Technologies for Archiving Internet Content

Harvesting technologies can be classified into site-directed and site-archived methods. Site-directed methods are similar to the operation of web browsers, where the user clicks on links to discover new pages. A simple form of automated site-directed harvesting is already implemented in many web browsers, with the Help -> Save As… dialog that allows the user to save a web page and all its dependencies. This method has evolved into more complex techniques with the use of web spiders or robots, which systematically explore and retrieve content from web sites. Pagefinder and WebCrawler are examples of early web spider programs, which start from an initial list of seed URLs and follow hyperlinks to new pages. The Internet Archive’s Alexa and Heritrix tools are capable of systematic or whole site archiving. Alexa is a remote service that provides archived data from the Internet Archive’s collection. Heritrix is an open source archival quality web crawler, which is designed to copy all resource of interest on to the local disk.

Caching technology has been used in archiving web data almost since the beginning of the web. The idea is that frequently accessed objects can be served from a local cache instead of going to the original server each time. This saves bandwidth and reduces server load. When a web resource is updated, there is a chance that the copy in the cache is stale. A Last-Modified date is used to determine if the cached object is still up to date. However, this method is not foolproof and the cache may serve expired content. Web sites can specify a Time to Live value for cached resources, which is an indication of how long an object is considered to be fresh. When the TTL expires, the cached copy is considered stale and a fresh copy is retrieved from the web. This method has an obvious shortcoming for archiving, namely that content expires from the cache and is lost.

There are various methods and technologies used for archiving internet content. It includes caching technologies, harvesting technologies, using search engines, and more recently, taking ‘snapshots’ of web pages. First we’ll talk about the caching technology.

5. Future of Archiving Internet Content

The upcoming Semantic Web will present both opportunities and difficulties to archivists. For those unfamiliar with the semantic web, it is the concept of the current World Wide Web being extended by inclusion of semantic content which facilitates content to be interpreted unambiguously by machines. The application of semantic mark-up to web content has been seen in various guises in recent years, such as the push for XML and more recently XHTML. An example of future semantic content is the use of machine understandable ontologies to describe archived web content. This era will be of value to archivists who will be provided with far richer context metadata concerning the meaning of web documents, enabling more intelligent harvesting and improved indexing for later retrieval. The web will be more difficult to archive because the inclusion of richer metadata will mean that more of the meaning of a page will be contained in associated metadata rather than visible content, and there is likely to be complex dynamic generation of content from back-end databases using ontology-specified information. An interdisciplinary effort involving researchers in the fields of web archiving and semantic web will be required to ensure that an enriched semantic web can be captured and preserved for future generations.

Another key issue posed by archiving is how to capture the dynamic nature of the Internet. A recent paper outlined the problems for archiving online documents as being that “the essence” of each page is bound to change while the URL remains static, that many pages are “structured” dynamically and are actually generated on the fly from a back-end database, and that multiple documents may be created from a singular source. They concluded that in order for an archive to be considered meaningful, a method for capturing the changes to dynamic documents must be developed and new “versions” of documents created from similar or singular sources must also be captured. It is interesting to consider the implications behind capturing multiple versions of documents released on the same URL or capturing the rewriting of history of documents online, which has drastic implications if censored or altered for political reasons.

1 Comment
  1. […] is greater than only a luxurious—it is a need. Whether you’re streaming high-definition content material, downloading large files, or conducting online gaming, having a fast and responsive internet […]

Leave a reply

ezine articles
Logo