Archiving a digital history

Katie Jacobs Bohn
September 18, 2015

UNIVERSITY PARK, Pa. -- Historians have pieced together history for years by trawling through attics, basements and other spaces for old journals, newspapers, letters and scrapbooks. These artifacts have given us important insights into the past, painting rich pictures of culture and reminding us of events we may have forgotten.

But now, Facebook has replaced scrapbooks, blogs have replaced diaries and many news outlets exist solely online. (And written letters have all but disappeared completely.)

While there are perks to these digital media — no ink to smudge on your hands, for example — they also create problems for today’s archivists and tomorrow’s historians. While common wisdom suggests that once something is on the Internet it’s there forever, many experts agree that the average Web page exists for mere months instead of years. Archivists need a way to preserve these digital artifacts so future historians have access to them.

Ben Goldman, Penn State’s digital records archivist and Sally W. Kalin Early Career Librarian for Technological Innovations, is trying to do just that. Goldman uses Archive-It — a service from the nonprofit organization the Internet Archive that allows users to make copies of Web pages and arrange them in collections — to digitally preserve cultural heritage, including the University’s academic and administrative documentation published on the Web.

“Web archiving is important because so much of Penn State’s media is ‘born digital,’ or in other words, there’s never a physical copy. A lot of the photos on the University’s Flickr page, for example, or its news articles,” said Goldman. “But we still need a way to keep and preserve this material so it’s not lost forever.”

The Archive-It system uses the Wayback Machine, a robot that uses two methods to make copies of Web pages. The first is through automatic “crawling”: twisting and turning through the Internet like Pac-Man, eating up and making copies of every Web page it encounters. The second is a manual feature — anyone can go to the Wayback Machine’s website and plug in any URL to make a copy and save it to the archive.

Once a copy of a Web page is created, it exists into perpetuity, even if the page is edited or deleted in the future. Users can see the history of a Web page by plugging in the desired URL and then browsing by date. If you’ve ever wondered what Penn State’s home page looked like in 1997, this is how you find out.

With the Wayback Machine gathering, copying and archiving so many Web pages every day, its database has become gargantuan. (The entire Internet Archive currently has 436 billion pages saved.) It’s also unwieldy, with no way to search it beyond viewing each website’s own history.

So, much like archivists and preservationists create collections of physical documents — like Penn State’s collection “Perry Family Diaries,” which chronicles Pennsylvania pioneer life — Goldman uses the Archive-It service to create and organize collections of digital media surrounding specific themes.

“Part of my job is maintenance, where I periodically crawl sections of Penn State’s website to make sure they’re preserved. Department websites, for example, or The Arboretum at Penn State’s page,” said Goldman. “But a large part of my job is curatorial, checking to make sure the Web crawler got everything it was supposed to and organizing specific content into collections.”

One such collection is the Pennsylvania Shale Energy Web Archive, which is documenting the social, economic and environmental impact of natural gas development in Pennsylvania.

“We’re trying to capture all sides of the shale energy topic,” said Goldman. “We want to preserve all sorts of content surrounding the natural gas effort, from the government websites to media coverage to the local communities using Facebook to talk about it.”

Helping him is Linda Musser, head of Penn State’s Earth and Mineral Sciences Library. She says that while she’s used digital Web archives in the past, this is the first time she’s been the one doing the archiving.

“It’s nice to be in the driver’s seat for this project. Instead of relying on what others have deemed worthy of preserving, we’re now the ones saying, ‘Yes, this is important. This is what we want to save,’” said Musser. “That is, after all, one of the things librarians are here to do.”

The collection is a work in progress. Musser says she’s always on the lookout for content that could be added or that they’ve missed. When she finds something, she passes it on to Goldman, who captures and categorizes it within the collection.

Goldman says that while collections like this one are interesting now, they will prove to be even more important in the future. Musser agrees.

“This has been unfolding over many years, and it’s impacting society on many levels. And a lot of it’s happening in Pennsylvania,” said Musser. “We’re trying to create a really broad overview of what’s happening so future generations can look back and get a sense of how it unfolded.”

To date, Goldman has preserved about 6 million files from 280 websites for the Pennsylvania Shale Energy Web Archive, adding up to about one terabyte of data. But this is just a drop in the bucket of the 436 billion total pages saved in the Internet Archive. That’s a huge database. For security, the entire archive is “mirrored” — or copied onto and stored — to the Bibliotheca Alexandrina in Egypt.

But preservation requires more than backup. Technology is constantly evolving, and it’s hard to know what digital archiving will look like 50 years from now, let alone hundreds.

“In the right environment, paper will last hundreds of years, but digital information has a lot of dependencies. To be able to access digital files in the future, you may need a certain kind of hardware and operating system, a compatible version of the software to open the file, not to mention electricity,” said Goldman. “A lot of digital preservation work involves mitigating the risks associated with these dependencies. For example, trying to use open file formats so you don’t need specific software programs that may no longer be around to access them.”

So the technique isn’t perfect. But then again, neither is paper. When correctly preserved, paper artifacts can last hundreds of years. But improper handling and storage can get in the way of preservation perfection, causing deterioration and rot.

Whether an archivist is preserving a newspaper or a news blog, there are difficulties with trying to grip the slippery, ephemeral nature of culture and history. But the Internet Archive and Archive-It are making it easier for Goldman and other University archivists to preserve Penn State’s legacy, one link at a time.

For more IT stories at Penn State, visit

(Media Contacts)

Last Updated September 18, 2015