Tools & helper files University of California, Berkeley [Also incorporated into the Online Archive of California ]


URL: http://sunsite.berkeley.edu/ead

Encoding Procedure:  

Since 1995 the UC Library has employed a wide variety of techniques to encode our legacy finding aids into SGML. This reflects the wide variety of formats these documents were in. As we began our retrospective conversion with The Bancroft Library's electronic finding aids--authored originally in WordPerfect--we began by employing WordPerfect macros of varying sophistication. The lead programmer provided intensive training in the WordPerfect macro language in the form of a series of seminars. The original WordPerfect macro manual used within the unit (which is now somewhat out of date) can be found here.

Since the beginning of the project we have utilized the technique of stepwise refinement to encode legacy finding aids. A practice we have continued to this day. Stepwise refinement involves beginning the encoding process by adding "coarse" markup, essentially fitting the legacy information into a broad hierarchical structure consisting of little more than component information. The a variety of techniques are employed to add more markup of an increasingly finer granularity, e.g., next adding the unittitle information, then encoding unitdates, etc. Most of these subsequent passes were performed also using WordPerfect macros, but as the project progressed the perl programming language was employed.

Today, every member of the Digital Publishing Group has completed 5 week classes in perl programming through the University Extension program and perl has become part of our markup lives. We have created a small toolkit of simple perl programs which is available at: http://sunsite2..edu/oac/toolkit. The kit is composed of several small scripts useful for stepwise refinement including scripts to recognize and encode unitdates, persnames, and corpnames within unittitles. The toolkit also includes a preconfigured parser (nsgmls) used to validate each and every finding aid before it is submitted for publication on the OAC.

Before long we found that we could more efficiently encode a finding aid's "front matter"--that is, all of the information not occurring within the dsc--through a standard web template. This proved faster than trying to create macros or specialized programs to accomodate the wide variety of layouts in the finding aids produced by the eight contributing repositories at UC . The templates can be seen in action at: http://sunsite..EDU/FindingAids/uc-ead/templates and the cgi script we use is available for anybody else to use part of the toolkit.

Curiously, we have found that using commercial SGML editors such as AdeptEdit, Author/Editor, or XMetaL, was not an efficient way to convert legacy information into EAD. Although each member of the Digital Publishing Group has copies of XMetaL installed, we find it useful solely as a reference tool, particularly while bringing new encoders up to speed in EAD. It is far faster to programmatically convert text to EAD in broad strokes than to apply the copy and paste method required when using these editors. XMetaL may have a role in the authoring of new finding aids, but much customization--mainly in the form of targetted dialog boxes and refinement macros--needs to be done before finding aid authors can consider it a viable replacement for their trusted word processing program.

After we completed conversion of all of our word processing files for legacy information held by Berkeley and by many of the affiliates of the Online Archive of California, a process funded by a variety of grants, we turned our attention to all of the legacy finding aids available only on paper. These we contracted out to a conversion vendor, Apex Data Services, which keyed the data and generated EAD. This EAD was then further refined in house when the data was returned. Our experience with employing an outside vendor for the process was fairly good, far better than our earlier experience using scanning and OCR in-house. Most finding aids required very little editing and correction but a small few of the more complex variety required great deals of time to bring up to local standards.

We are investigating a variety of options for incorporating EAD directly into the authoring process, including a complete suite of MS Word templates and macros, dubbed "EAD Stylus", and available as part of the toolkit. Another option is to more fully integrate EAD into the Generic Digital Projects Database, developed initially for UC 's role in the Making of America II project. The Generic Database was designed to accomodate the workflow and data entry for 's variety of digitization projects including images, electronic text, sound files, moving pictures, etc. As it was intended to accomodate hierarchical description and produce arbitrarily generic output, it was easily adapted towards EAD.

Relational databases have taken on a larger role at in recent years. We now can easily import EAD-encoded finding aids into any arbitrary relational database--for enriching the data, adding item-level information for digitized surrogates, collection management, etc.--and exporting back out to EAD or serving out on the web. A tutorial and several sample programs written in perl are available at: http://sunsite..edu/ead/eaddb .

Now that conversion of our legacy finding aids is complete we are involved more and more in digitizing surrogates of the archival materials themselves: selected photographs, books, diaries, letters, both represented by images or sequences of images, and as searchable electronic text encoded in TEI. We are committed to using the emerging METS standard for encapsulating single and multipart digital objects in XML "wrappers." More information on these efforts is available on our Making of America II website.

Since the earliest days of the project, has realized the importance of developing and adhering to consortial standards. The EAD encoding standard allows a surprisingly divergent, and often distressing, variety of encoding methodologies. In 1996 four institutions, UC , Stanford University, Duke, and the University of Virginia, met to develop a uniform encoding standard for EAD finding aids. This standard, the American Heritage Retrospective Conversion Guidelines , was adopted and later developed upon and refined by the UC EAD consortial project which later grew into the Online Archive of California . Recently, the Online Archive of California has developed a standard for the encoding of new finding aids, the Best Practices Guidelines for the Encoding of New Finding Aids, which builds upon those guidelines layed out in the Retrospective Conversion Guidelines. Although intended for new finding aids, the BPG provides guidelines which are beneficial to all finding aids. Although we foresee difficulties applying the full BPG to our "legacy" EAD documents we are involved in a process of upgrading them to a subset of BPG programmatically. This involves, most importantly, stripping out the old style <drow>/<dentry> tabular markup employed in the early days of EAD at , and combining the separate Series Description and Container List into a single <dsc> of type "combined".

Finally, UC has no plans at the present time to begin encoding finding aids in XML. First, all of our current tools handle both XML and SGML so there is no reason for us to switch. Secondly, the XML standard lacks the robust entity management mechanism present in the SGML standard. We have found this entity management to be crucial, especially when interchanging finding aids with other institutions and consortia (hard-coding a specific path or URL in every entity declaration is onerous). If new tools become available for either authoring or publishing, which require XML and which we would find valuable, or if stronger entity managment is included in a future version of the XML standard, we would like to switch over.

All of our raw SGML files may be accessed in the SGML section of the Online Archive of California: http://www.oac.cdlib.org/sgml

Delivery Mechanism:  

We are currently serving out our finding aids via The California Digital Library's Online Archive of California. The OAC uses the DynaWeb server software from the Enigma Corporation.

Contact:  

Lynne Grigsby-Standfill, Head Digital Publishing Group UC Library lgrigsby@library..edu Alvin Pollock, Lead Programmer Digital Publishing Group apollock@library..edu

RLG Member:  

Yes.

Last updated:  November 2001

Update information:
If any information concerning the above EAD implementation is incorrect or out of date download the XML source file for this entry, make required changes and mail back to levjen@umd.edu. Updated entries may only be submitted by the contact listed above.