Preservation strategies in academic and research libraries are not new concepts. However, with an increasing amount of digital content, organizations are having to cope with a new set of preservation issues.

Digital preservation is in its infancy worldwide and presents some difficult technological issues. Since the creation of digital media, over 200 different storage mediums have been invented ranging from magnetic tape to CD-Rom. Each of these mediums present a variety of their own preservation issues and also require a diverse range of technology which in many cases is no longer manufactured. In addition to this, there are thousands of different formats in which data can be stored on each medium; and each type of storage format may also require a specific piece of software to interpret the data's meaning.

So what is a library to do in order to protect digital content? There are no clear standards in the area of digital preservation and with most institutions lacking resources already, how are they to tackle these issues?

This blog documents the joint investigation into the preservation of digital assets such as ejournals and eprints at Queensland University of Technology and Simon Fraser University.

Tuesday, August 29, 2006

Howard Besser's take on preservation strategies

Once again I have a colleague to thank for passing on this reference... thanks Mark.

Besser has synthisised the issues surrounding digital longevity into five general areas. These are, in my opinion, a great guide for someone with any level of digital preservation experience.

The Viewing Problem - All digital formats require computer technology to view them. By nature technology (software/hardware/formats) move at such a rapid pace that, odds are, they won't be around when you want to view your data. This is of course unless you're viewing data right after you "preserve it" in which case, it's not really preserved now is it.

The Scrambling Problem - Data is often compressed or "scrambled" to assist in its storage and or protect it's intellectual content. These compression and encryption algorythms are often developed by private organisations who will one day cease to support them. If this happens you're stuck between a rock and a hard place. If you don't want to get into legal trouble you are no longer able to read your data; and if you go ahead and "do the unwrapping yourself" it's quite possible you're breaking copyright law.

The Inter-Relation Problem - Digital information is often linked to other items. This is much more evident in the digital world than the physical. If these links aren't maintained the information is either incomplete, incorrect, or just plain doesn't make any sense. Unfortuntely, due to the diversity of digital linkages and the relatively recent identification of these issues, they're often overlooked. A simple example of this is links on web pages which have died, never to be resolved again. Frustrating!

The Custodial Problem - Who is the custodian of a digital document? Is it a librarian's job? What if someone changes the content without telling the custodian, after all digital content is dynamic and easily changed. So does the document's custodian have to undertake version control? And then is it really preservation?

The Translation Problem - If we need software to interpret data (due to formats etc), and software changes version to version will it be translated differently in subsequent versions? Even if the software claims it will sometimes it might lose formatting, a font? This is particularly dangerous where the changes are subltle or so small that noone notices them, or does it really matter at all?


Besser, Howard (2000) 'Digital Longevity', in Handbook for Digital Projects: A Management Tool for Preservation and Access
Available at http://www.nedcc.org/digital/ix.htm

Media Obsolecence

The 1960 U.S. Census was stored on a now obsolete computer tape. Only one machine in the U.S. can read those tapes, and that machine is in the Smithsonian Institute.

Magnetic tape, on which most of the world's computer backups are stored, can degrade within a decade.

About 20 percent of the data collected for NASA's 1976 Viking Mars landing is completely unreadable and lost forever. With over 1000 people working on the landing alone can you imagine how much they spent to get that 20% of data in the first place?

Digital Preservation Management

This is a great site which was brought to my attention by a colleague. In 2004 it was the winner of the Society of American Archivist's Preservation Publication Award.
The site is a sort of bird's eye step by step view of digital preservation and illustrates some of the difficulties present in the area. The authors have included timelines referring to different types of media and how long each lasted. It also includes a quiz on digital media which is likely to illustrate how little librarians know about digital preservation.
The link below will bring you to a page with two tutorials, hit the "digital preservation" tute and away you go.
http://www.library.cornell.edu/iris/tutorial/dpm/

Thursday, August 24, 2006

Fragile Media: the myth of the 100 year CD-Rom

This article is a great example of how easily we can lose digital content through poor selection of storage mediums. The progression of technology often motivates institutions to move from one storage medium to another through forced obcolesence. However it is interesting to note that new and "improved" mediums do not always deliver the same level of functionality. Ironically, many libraries who initially began migrating data to digital repositories under the banner of "economy" have found that it is too costly to keep up with technology and are using microfilm and acid-free paper to back up digital content. The "myth of the 100 year CD-Rom" is a good example of how we've been sold a potentially disastrous product for digital preservation on many levels whether it's personal photos, or important research data. When you compare the potential preservation value of a CD-Rom with an undetermined shelf life beginning at only two years, all of a sudden keeping paper records doesn't seem too bad.

Malda, R. The Myth of the 100 Year CD-Rom
http://www.rense.com/general52/themythofthe100year.htm

Wednesday, August 02, 2006

Threats to digital preservation

Massive storage failures
Basically no matter how much money you spend on the system housing your data there are still many ways in which it can fall over and create opportunities for data to be lost. This may be from hardware/software failure or an act of war. The longer you try to store data the more likely this will occur.

Mistaken erasure
Sometimes people accidentally delete things and if it's the only copy, then it's gone. On the other hand sometimes people think that they no longer need a piece of data and delete it on purpose only to find that it was in fact useful. The longer you try to store data the more likely this will occur.

Bit rot
No affordable digital storage is completely reliable over a long period of time. For example some CD's have recently been shown to have a life span of only 2 years which could cause significant problems for anyone relying on them. Other media such as magnetic tape also suffers various types of bit rot. The worse thing about this threat is that is often undetected until it's too late to recover the material. You would very nearly have to employ someone to check all your media all the time to minimise data losses which would make most of these mediums too expensive to seriously consider in a preservation project. Bit rot is inevitable with any storage medium over a period of time.

Outdated media
Over time all kinds of digital media become outdated. Technology is driven by innovation which unfortunately leads to very short periods of relevancy before redundancy. Data stored on redundant media becomes effectively useless if the appropriate hardware is not available to read it. This is a particularly difficult issue to manage where data is stored over long periods of time. Ideally, long term data storage should be technology independent, however this is not practical. A Cornell University website (mentioned above in another post) has actually documented the lifespan of various storage media with floppy disks lasting a whopping five years.


Outdated formats, applications and systems
As hardware becomes redundant, so do file formats and the software which interprets them. A good example of this is Word Perfect; try to find a computer today which can read a Word Perfect document properly. Fortunately, system and format redundancy does not usually happen at quite as rapid a pace as hardware.

This is a difficult problem for long term storage and there are two common, but awkward, solutions. The first is to preserve a copy of the appropriate software and make it available wherever that data is stored. This becomes increasingly unmanageable as the types of systems required increases. The second is to migrate data to an acceptable format, for example all text files might be migrated to pdf thus only requiring copies of Adobe Acrobat to be preseved. However, during the migration process it is possible to lose data. It is also a costly process in terms of work hours and expertise.

Loss of context
Some data can be related, and this relationship can be vital to data interpretation. A good example of this might be the Rosetta Stone, discovered in Rashid, Egypt. The stone is engraved with hieroglyphics in three different languages and without the "key" of what these symbols meant noone was able to read the inscription. It took a French scholar Jean Fran├žois Champollion fourteen years to decipher the inscription. Can you imagine if you had to take that amount of time to decipher each document on your PC because someone had forgotten to preserve the relationship between that document and its key? It would be like trying to assemble Ikea furniture without instructions, a complete waste of time. Unfortunately, if this relationship is not identified and preserved when information is first stored it is unlikely to ever be recovered. The longer the data is kept without this relationship, the less likely it is to ever be resolved.

Intentional attacks
Unfortunately in the world we live in there are some people who intentionally destroy or damage digital assets for a variety of reasons. As much of the information is currently located in open access repositories accessible via the internet it is also vulnerable to attack. This is a threat to both long and short term storage.

Lack of resources
Many institutions simply do not have the resources, usually financial, to consider digital preservation. These strategies are often overlooked as low priority and are likely to remain so until a major data loss scares people into action.

Organisational failure
This is a massive threat to long term digital storage of any kind. Technology is so dynamic not only in innovations but also movement with vendors and competition killing off what seemed to be at one point very strong tech players. For this reason it would be a folly to rely too heavily on any one vendor/system/sponsoring organisation because they change and often change quickly. Digital assets which need to be preserved long term must be protected from the failure of any one organisation. Unfortunately this is easily said but hard to plan for in such a dynamic environment.

Why Traditional Storage Systems Don't Help Us Save Stuff Forever
Baker, M. Keeton, K. Martin, S. June 27 2005
HP Labratories Palo Alto