Web Archiving

Archiving used to be a profession that was restricted to only information professionals. The tools such as data mining, web crawlers, and other information retrieval tools were a knowledge exclusive to the professional Archiving community. However, in recent years and the explosion of Web 2.0 technologies and the internet boom social media has sprung up all over the world. Yet the shift in power has gone from the professionals over to the users. Users are generating and sharing their ideas, information, knowledge, content and lives through social media pathways such as twitter, facebook, blogs, pinterest, personal websites, instagram and much more. The big three Facebook, Twitter and Youtube has found themselves as enablers of change such as political revolutions, wikileaks, and bringing attention to other national concerns. It makes sense that with history being written on seeming intangible interfaces the challenge of recording such history would prove difficult. Archiving as a profession is concerned with preserving the human record for research and educational use. Yet with users holding the majority of control over this new human record it only makes sense that these users seek to preserve their online social capital and history. Whether for personal use or business, users have become with the help of freely accessible web archiving tools, their own Web Archivist.

Youtube was created in 2005 and while originally geared towards home videos it has grown over the years to house professional content. As of today Youtube boasts users watching 4 billion hours worth of video each month, and uploading 72 hours worth of video every minute. Youtube also played an important role in the Arab Spring that started in 2010 and are currently ongoing. Youtube has helped create stars such as Justin Bieber and Psy in the world of music. Yet whether one prefers to watch cute kitten videos or Psy music videos there is no denying that Youtube is a giant powerplayer in the current social media internet dominating age. Since 2003, The United Nations Educational Scientific and Cultural Organization commonly referred to as UNESCO has sought to preserve this seeming un-preservable content on youtube. These videos are a part of the human record and show in their own unique way cultural heritage of the internet age. Youtube is already its own unregulated archive in a sense but has the power to remove and delete videos at will and without notice. Another issue that arises from youtube being an archive within itself is the role of corporations contributing to youtube and in a sense profiting from the labor of users. One archiving project undertaken to record youtube videos has been a project created out of MIT Free Culture research project that documents videos taken down for copyright infringement on youtube. Youtomb constantly monitors the most popular videos on YouTube for copyright-related takedowns. Any information available in the metadata is retained, including who issued the complaint and how long the video was up before takedown. The goal of the project is to identify how YouTube recognizes potential copyright violations as well as to aggregate mistakes made by the algorithm. Currently Youtomb is monitoring 440036 videos, and has identified 9760 videos taken down for alleged copyright violation and 212711 videos taken down for other reasons that are not known.  The Archiving tool Archive It informs users  on the various types of crawl configurations that can be used to archive youtube content. Robots.txt is created by the owners of a seed Web site that limits the site content that can be crawled. Youtube blocks some of the content on their site using robots.txt. It is important that the crawler ignores robots.txt. The most effective way to archive these videos is to use the “Crawl One Page Only” seed type.  Information on configuring a seed as a “Crawl One Page Only” . se the “watch” page as the seed in order to capture a specific video.  If multiple videos are to be archived, add each “watch” page as a separate seed URL.  The captured content will include the video posted on the “watch”page, the page itself, and any comments and other information on the page.  Also, some or all of the “related” videos on the page may be captured.




For MLIS students who are interested in web archiving also check out Archive It! a useful tool for beginner web archivists.


Bibliotheca Alexandrina

The Bibliotheca Alexandrina Internet Archive is a backup system for the Internet Archive originally created in San Francisco.  The content of web pages from 1996 to present have a record in the IA and the Bibliotheca  Alexandrina Internet Archive has a record of web pages from 1996 to 2007. The BA also has information from the Middle East and Africa.

The IA harvests it’s information with the Heritrix which is an open-source , extensible, web-scale, archival quality web crawler project. Heritrix only collects material available by HTTP/HTTPS, DNS, and FTP. Heritrix is available for free downloadable use by other users. The other technology that the IA uses is the Wayback Machine. The Wayback Machine allows users to search through archived web sites. Storing the Archive’s collections involves parsing, indexing, and physically encoding the data. With the Internet collections growing at exponential rates, this task poses an ongoing challenge. The IA stores their increasing information with hardware that consists of PCs with clusters of IDE hard drives. Data is stored on DLT tape and hard drives in various appropriate formats, depending on the collection. Web data is received and stored in archive format of 100-megabyte ARC files made up of many individual files.  A great fear is that the data that is stored could somehow be destroyed. One of the ways the IA is preventing accidents that could erase the stored data is by having multiple copies of the same information. Since technology is changing so quickly the IA is developing emulators so future researchers will be able to access and use the information stored in the archives.

The Bibliotheca Alexandrina Internet Archive is a mirror to the policies, collection methods and goals that the IA states in its main US domain. However, at the BA Internet Archive the information is stored using the petabox. The petabox is a new machine designed to safely store and process one million gigabytes of data. The machine features low power consumption, support for multiple operating systems, easy maintenance and software to automate mirroring.

The archive at the Bibliotheca Alexandrina includes 70 billion WebPages covering the period 1996–2007, 2000 hours of Egyptian and US television broadcasts, 1,000 archival films and 25,000 digitized books acquired through the Open Content Alliance consortium. It is capable of storing 3.7 petabytes of data on 1636 computers.  BA is one of the leading libraries and archives outside of the US and is effectively collecting content from the Middle East and Africa. As of last year the Library of Alexandria reached 10 petabytes of information and the founders of the Archive hope to continue documenting cultural content through its archiving process.

Bibliotheca Alexandrina http://www.bibalex.org/Home/Default_EN.aspx




MLIS Degree is it worth it?

Some of you may have had the fortune to to read this article regarding the value of obtaining a MLIS degree in comparison to other options. Perhaps you are a liberal arts soul like myself and saw librarianship as a way to have a stable (ish) job after graduation and do something I enjoyed. Everyone knows that this is not the job market our parents and grandparents had the pleasure of living in. Good jobs with a nice growth rate are hard to come by and succeeding nowadays requires sacrifice coupled with continual hard work.  So with the publishing of this article by Forbes in 2012 which can be accessed here http://www.forbes.com/sites/jacquelynsmith/2012/06/08/the-best-and-worst-masters-degrees-for-jobs-2/

Library and information science degree-holders bring in $57,600 mid-career, on average. Common jobs for them are school librarian, library director and reference librarian, and there are expected to be just 8.5% more of them by 2020. The low pay rank and estimated growth rank make library and information science the worst master’s degree for jobs right now.”

Lovely. It should be made in a motivational meme. Or did other MLIS students also feel kicked in the gut after reading this article, as I did? Many responses from other librarians starting showing up on the internet.  However, I feel that there is no yellow brick road to this idea of success and “making it” whatever that may constitute. Instead scholars of any field need to work harder to be successful, volunteer, take opportunities and not be afraid to fail. An MLIS provides its students with skills that are applicable beyond the field of library science. As displayed in this post http://infospace.ischool.syr.edu/2011/12/23/61-non-librarian-jobs-for-librarians/ Creative Project Manager Among the listed were Director of Community Service, Web Analytics Manager, Information Resources Specialist, Technical Information Specialists, Documentation Specialist, Geographic Information System Map Specialist, and Digital Reference Librarian just to name a few.

Through out my educational career I have seen students filled with the desire to have their future career be very lucrative. Yet what I think librarians have that may appear to others a poor substitute for salary compensation is the satisfaction of helping others. As the annoyed librarian said, “We librarians aren’t in it for the money. We’re in it for the relaxation and the goodwill.” So fellow MLIS students have you encountered similar dissuasion? How did you respond?