Web Archiving

Archiving used to be a profession that was restricted to only information professionals. The tools such as data mining, web crawlers, and other information retrieval tools were a knowledge exclusive to the professional Archiving community. However, in recent years and the explosion of Web 2.0 technologies and the internet boom social media has sprung up all over the world. Yet the shift in power has gone from the professionals over to the users. Users are generating and sharing their ideas, information, knowledge, content and lives through social media pathways such as twitter, facebook, blogs, pinterest, personal websites, instagram and much more. The big three Facebook, Twitter and Youtube has found themselves as enablers of change such as political revolutions, wikileaks, and bringing attention to other national concerns. It makes sense that with history being written on seeming intangible interfaces the challenge of recording such history would prove difficult. Archiving as a profession is concerned with preserving the human record for research and educational use. Yet with users holding the majority of control over this new human record it only makes sense that these users seek to preserve their online social capital and history. Whether for personal use or business, users have become with the help of freely accessible web archiving tools, their own Web Archivist.

Youtube was created in 2005 and while originally geared towards home videos it has grown over the years to house professional content. As of today Youtube boasts users watching 4 billion hours worth of video each month, and uploading 72 hours worth of video every minute. Youtube also played an important role in the Arab Spring that started in 2010 and are currently ongoing. Youtube has helped create stars such as Justin Bieber and Psy in the world of music. Yet whether one prefers to watch cute kitten videos or Psy music videos there is no denying that Youtube is a giant powerplayer in the current social media internet dominating age. Since 2003, The United Nations Educational Scientific and Cultural Organization commonly referred to as UNESCO has sought to preserve this seeming un-preservable content on youtube. These videos are a part of the human record and show in their own unique way cultural heritage of the internet age. Youtube is already its own unregulated archive in a sense but has the power to remove and delete videos at will and without notice. Another issue that arises from youtube being an archive within itself is the role of corporations contributing to youtube and in a sense profiting from the labor of users. One archiving project undertaken to record youtube videos has been a project created out of MIT Free Culture research project that documents videos taken down for copyright infringement on youtube. Youtomb constantly monitors the most popular videos on YouTube for copyright-related takedowns. Any information available in the metadata is retained, including who issued the complaint and how long the video was up before takedown. The goal of the project is to identify how YouTube recognizes potential copyright violations as well as to aggregate mistakes made by the algorithm. Currently Youtomb is monitoring 440036 videos, and has identified 9760 videos taken down for alleged copyright violation and 212711 videos taken down for other reasons that are not known.  The Archiving tool Archive It informs users  on the various types of crawl configurations that can be used to archive youtube content. Robots.txt is created by the owners of a seed Web site that limits the site content that can be crawled. Youtube blocks some of the content on their site using robots.txt. It is important that the crawler ignores robots.txt. The most effective way to archive these videos is to use the “Crawl One Page Only” seed type.  Information on configuring a seed as a “Crawl One Page Only” . se the “watch” page as the seed in order to capture a specific video.  If multiple videos are to be archived, add each “watch” page as a separate seed URL.  The captured content will include the video posted on the “watch”page, the page itself, and any comments and other information on the page.  Also, some or all of the “related” videos on the page may be captured.




For MLIS students who are interested in web archiving also check out Archive It! a useful tool for beginner web archivists.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s