This page is meant to address some basic practical questions about implementing EAD in XML. It assumes a general level of knowledge of EAD and SGML, such as might be gained from an EAD workshop or by marking up finding aids in EAD. While this page focuses on XML exclusively, it does present information also found in the EAD Working Group's EAD Application Guidelines and The EAD Cookbook by Michael Fox. I've tried to reference these works below, and you may want to check them for alternate presentations and, at several points, more complete information.
Both SGML and XML can get quite complex in their details, and solving practical problems often involves details. In answering specific questions, I've tried to concentrate on what needs to be done, but I've also felt it important to explain why it needs to be done. My hope is that this knowledge can help demystify XML (and SGML) and lead to greater understanding and comfort with these technologies.
If you have questions, comments, or suggestions regarding this page, please feel free to contact me.
David Ruddy
Digital Library and Information Technologies
Cornell University Library
dwr4@cornell.edu.
Last Updated: 28 July 2000
Table of Contents
© or á) in XML?What is XML?
XML (Extensible Markup Language) is a restricted subset of SGML.
SGML is a complex and very powerful document description, publication, exchange, and storage standard. Because of its complexity, however, several of its features are costly to implement in software, with the result being that relatively few SGML software packages exist and many of the existing ones are expensive. XML was born out of a desire to preserve the core strengths of SGML while making it easier to build compliant software tools, at the same time creating a document exchange standard for the internet--more useful and less constraining than HTML but lighter weight than SGML. XML designers removed the more complex (and thus under implemented and under utilized) features from the SGML specification, and simplified many others. In the end, XML will impose few restrictions on the majority of SGML markup schemes (though many of these tag sets will need to be reformulated).
The fact that XML is a subset of SGML has an important consequence. In general, valid XML documents are by definition valid SGML (of course, the converse is not necessarily true). If you alter your EAD SGML documents to make them XML compliant, they do not cease being SGML.
Why would I want to create EAD finding aids in XML rather than SGML, or convert my existing SGML finding aids to XML?
You may not want to. This depends a lot on your current system and environment. If you're begining to use EAD and have no process or system in place, and thus no prior investment in authoring or delivering SGML encoded guides, then working in XML is probably a good choice. Why? XML is the future. Before long, XML will offer more various, plentiful, and affordable software solutions. SGML software will likely remain static--either difficult to use or relatively expensive. Furthermore, XML was designed for the Web, and it opens up several options regarding the publication of encoded documents in that way. These options are bound to increase in the future.
On the other hand, if your current authoring, conversion, and delivery system is tied to SGML, and especially if that system represents a large investment, then you will want to consider how easily that system could be adjusted to handle XML data. If that cost is significant, you might want to hold off on XML until there are other compelling reasons to make a system change.
To put this another way, it's not the conversion of your existing documents from SGML into XML that will represent significant costs. This can be accomplished relatively easily. It's the adjustments needed in the surrounding production procedures and delivery system that could prove expensive. Remember also, since moving your data from SGML to XML is easy, it is not as if you are creating a larger and larger future cost for yourself by continuing to work in SGML. When the time comes that you want to change to an XML system, converting the data will not be difficult.
What's the difference between an EAD encoded document in SGML and one in XML?
There is very little difference. This is because the designers of EAD, anticipating XML, created EAD so that it is nearly XML compliant as it is.
How do I set things up to alter and validate EAD documents in XML--that is, how do I begin working with EAD in XML rather than in SGML?
This is not difficult. In describing what needs to be done, it is useful to think of the changes needed as falling into one of two areas of attention. One has to do with the EAD Document Type Definition (DTD) and the markup declarations that begin each document, and the other has to do with the actual text of your finding aid with all its descriptive markup.
When you open an EAD document, you can see the Document Type (DOCTYPE) Declaration at the top. This markup declaration is called the document prolog. Everything else in the document is an "instance" of the document type--in this case, the document type is "ead" and the instance of it is everything between (and including) the start and close <ead> tags. That's your encoded finding aid.
So we've got two problems: converting our document instances to XML, and altering the DTD and the DOCTYPE Declaration so that they conform to the XML standard. The next two questions address these two problems.
How do I convert my EAD SGML document instance (the finding aid and its descriptive markup) so that it's XML compliant?
This is relatively easy. There are just a few things that are allowed in EAD SGML markup that are not allowed in XML. Valid XML markup needs to comply to the following requirements:
/>". There are only seven empty tags in EAD (two of which are in the tabular tag set). You can find these by searching the EAD DTD for the word "EMPTY." The most commonly used empty tags in EAD will be <lb>, <ptr>, and <extptr>. These and any other empty tags in your markup need to be altered as follows if you want your documents to be valid XML:
<lb>changed to<lb/>
<ptr>changed to<ptr/>
<extptr>changed to<extptr/>
Most SGML authoring software puts quotes around all attribute values automatically, so depending on how you generated your SGML, this may not be a problem at all.<container type=box>in XML must be<container type="box">
If you ensure that these conditions are met, your EAD document instances will now be XML compliant. Of course, you don't need to worry about any of these conditions if you are using an XML authoring tool, such as XMetaL, to create your documents--the software takes care of all of that for you.
For additional information on moving your EAD guides from SGML to XML, see The Guidelines., pp. 134-35. The Cookbook, section 5, describes two pieces of software that help automate the above changes.
How do I alter the EAD DTD and the Document Type (DOCTYPE) Declaration so that I can use XML software with my EAD document?
ead.dtd) with a text editor. Go to the section titled "SGML EADNOTAT AND EADCHARS INCLUSION/EXCLUSION." At the end of this section, you will see the line:
<!ENTITY % sgml 'INCLUDE' >You need to change the single word "INCLUDE" to "IGNORE".
<!DOCTYPE ead PUBLIC "-//Society of American Archivists//DTD ead.dtd
(Encoded Archival Description (EAD) Version 1.0)//EN" [
<!ENTITY RMCaddress PUBLIC "-//Cornell University::Cornell
University Library::Division of Rare and Manuscript Collections//TEXT
(RMC contact information)//EN">
]>
The Declaration begins with a "declaration" of a document type--in this case "ead"--and the identification of the rules that govern that document type. The rules are the Document Type Definition (DTD), and they are identified here by means of a PUBLIC identifier. This means that there must be an entity managment system in place (using catalog files) which map the public identifier to a specific system file (the EAD file ead.dtd). Rather than depending on an entity managment system, one could use a SYSTEM identifier instead of the PUBLIC identifier, or in addition to it. A system identifier would give the system location (path) to the named resource.
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="../styles/style-1.xsl"?>
<!DOCTYPE ead PUBLIC "-//Society of American Archivists//DTD ead.dtd
(Encoded Archival Description (EAD) Version 1.0)//EN" "../dtds/ead.dtd" [
<!ENTITY % eadnotat PUBLIC "-//Society of American Archivists//DTD eadnotat.ent
(EAD Notation Declarations)//EN" "../dtds/eadnotat.ent">
<!ENTITY RMCaddress PUBLIC "-//Cornell University::Cornell
University Library::Division of Rare and Manuscript Collections//TEXT (RMC contact information)//EN"
"../entities/RMCaddr.xml">
%eadnotat;
]>
A number of new things are going on here. The first line is the XML Declaration, and though not required in its default form (as we see it here), it should be included in every XML document. You may see slight variations on how this line looks, but at a minimum, use <?xml version="1.0"?>.
This is followed by another line that identifies a specific XSL stylesheet associated with this particular XML document. You may or may not have a stylesheet. See the question on XSL below.
The DOCTYPE declaration (beginning on the third line) is much the same as in the SGML version, but notice that it is now followed by a quoted system identifier (../dtds/ead.dtd), which is the path and filename of the resource, in this case the primary EAD DTD file. In XML, DOCTYPE and ENTITY declarations must include system identifiers. Because these system identifiers are unique to each environment, yours will likely be different.
The parameter entity "eadnotat" is now declared in the declaration subset instead of in the DTD itself. Note the required system identifier in the declaration pointing to the file eadnotat.ent. Also, it's not enough to merely declare the entity in order to use it, one must also reference it. This is what the line %eadnotat; does.
It is not necessary to declare and reference the eadnotat entity unless you will be referencing "non-SGML" data in your document. Non-SGML data is data that will not be parsed--typically non-textual data such as graphics (see The Guidelines, pp. 183-85, for more discussion). Non-SGML data is declared in the DOCTYPE Declaration subset with entity declarations such as:
<!ENTITY logo SYSTEM "our-logo.gif" NDATA gif>
The Guidelines (pp. 134-35), however, suggest adding the declaration of eadnotat and its reference to every declaration subset, regardless of whether or not you use it in any particular document. The more standard you make your DOCTYPE Declaration, the easier it will be to manage your files.
The general entity declaration for "RMCaddress" looks just as it did in SGML, but notice that it now has a system identifier, as do all the declarations.
To summarize the important changes to the DOCTYPE Declaration in an XML environment:
For another discussion of these changes, see The Guidelines, section 4.3.2.1, and The Cookbook, section 5.
What options exist for publishing XML encoded EAD finding aids on the web?
If we think about delivering a single XML document over the web, there are two basic approaches: one is to convert the EAD XML guide to HTML and send this to the user; the other is to send out XML directly and hope that the user's browser can understand and present XML.
See The Guidelines, section 5.3.2.3, for more on HTML delivery of EAD guides.
What is XSL? What are XSL stylesheets?
XSL is an abbreviation for "Extensible Stylesheet Language." If we think of XML as focused on describing the structure of a document, XSL is concerned with presenting, or rendering, that document in some specific format. XSL has been divided into two sections with distinct functions. The first deals with formatting semantics--how to indicate precise presentational instructions like font family and size, line spacing, indentations, etc. (this XSL formatting specification is still in draft form). Yet presenting a document that contains no information about how it should be presented requires a transformation of sorts, and the second part of XSL has to do with transformative functions. This transformative language is called XSLT, and version 1.0 of it has been formalized in a W3C recommendation.
It is possible to use XSLT independently of the XSL formatting semantics, and in fact this is quite useful in XML-to-HTML conversions. If we want to display our EAD document on the web, we will want to transform the content of an EAD series level <unittitle> tag to something that looks like a heading, maybe using an <H2> heading in HTML. XSLT provides a mechanism for doing this sort of thing. Alternatively, if we're sending our EAD document to a PostScript printer, we will need to transform that element into some sort of PostScript command. The idea is that the same underlying data can be rendered in a variety of ways, depending on variables such as the output format desired (e.g., web vs. print) or the audience requesting the file (e.g., staff vs. public view).
XSL styleheets are XML documents that contain these transformative instructions. XSL stylesheets are prepared for a single type of XML document (like EAD documents), and they are typically geared toward a particular implementation of these documents. That is, stylesheets may expect specific markup conventions, although some may be deliberately built to accommodate a wide range of tag usage. Stylesheets are attached to XML documents by means of an instruction directly following the XML Declaration at the top of the document. For an example of such an instruction, see the question above on altering the DOCTYPE Declaration for XML use.
The Cookbook includes several XSL stylesheets that produce HTML output files that vary in their look and feel.
For more on XSL, see The Guidelines, sections 5.3.3.2, and 5.3.3.6.
How do I work with special characters in XML?
Special characters are those characters not easily available on a standard English-language keyboard, those beyond the 128 characters of the US-ASCII character code set. These include characters with diacritics (á, ü, etc.), and special symbols such as the copyright sign (©). Special characters are handled differently in XML than they are in SGML.
The character code sets in use in the mid-1980s, when SGML was finalized, were relatively small, and the SGML Declaration provided a mechanism for selecting different document character sets. But it was obviously useful to have an unabiguous method of indicating a special character that would be independent of these different character code sets. The ISO provided this method by developing SGML character entity sets that standardized special character names. This allows us to use character entity reference names such as "ü" or "©", knowing that they will mean the same thing across every SGML implementation. To ensure this uniform understanding, we do need to reference a whole bunch of special character entity sets, with names like "ISOlat1" or "ISOpub". In EAD, the DTD handles this referencing automatically through an entity called "eadchars", and the process is largely invisible.
Of course, getting these special character names to display properly is a local character set problem, and so the task of the SGML character entity sets is to map the standard character names to something locally useful. A number of SGML software applications, such as the SGML browser Panorama, use bracketed ISO character names, and so the mapping useful for these browsers is from "ü" to "[uuml]". Other display software might require something else, but virtually any mapping is possible. The important point is that in SGML the mapping is determined and managed at a local level.
XML version 1.0 has a fixed document character set: Unicode. Unicode is a 16-bit character code set that allows for 65,536 assigned characters (and even more with special tricks). The position of every character is indicated by a number, or code value, from 0 to 65,535. Hexadecimal, an alphnumeric, base-16 notation, makes it easier to write large numbers, and Unicode conventionally uses hexadecimal when referring to code values, although in XML you can use either decimal or hexadecimal. The copyright symbol, which is in the 169th position, has the code value "00A9" (hexadecimal for 169), and "u" umlaut has "00FC" (hexadecimal for 252).
In XML, these characters are accessed by using numeric character references, which for hexadecimal begin with the delimiter "&#x" (the decimal start delimiter is "&#"), and end with a semicolon (;). So in our markup we can write "©" to indicate ©, or "ü" for ü. All XML compliant systems will understand what these characters are, without the need for referencing additional character entity sets. That's why when we switch the EAD DTD to XML mode (see the above change to the EAD DTD), we are removing the automatic declaration of the entity "eadchars", which manages the declaration and referencing of all the various ISO character entity sets. In XML, we don't need these anymore, as Unicode provides a cross-platform, software independent method of indicating special characters.
XML thus makes indicating special characters quite easy. The only problem may be that it's a burden to remember numeric codes such as 00A9 and 00FC, but most XML authoring tools will use lookup tables that make remembering numeric codes unnecessary. Another potential problem is SGML legacy data with ISO character names like "©" and "ü". These entity reference names will not be recognized in an XML environment and will produce validation errors. They need to be converted to numeric Unicode references. This may not be a serious task, depending on the number and variety of special characters used. If for whatever reason you want to retain entity reference names, this is possible in XML. See the following question on how this can be done.
How do I continue using character entity names (such as © or á) in XML?
It is possible to use character entity names for special characters in XML, but we need to configure our local system in particular ways. By doing this, however, you reduce the portability of documents containing these names, as anyone you are exchanging your data with will need to have set up their system similarly. It could be argued that using these character entity names defeats the purpose of Unicode. On the other hand, there may be reasons in the near term for retaining these ISO special character names in some local systems.
If you would like to use ISO character names, it can be accomplished through a mechanism similar to what we were employing in SGML (see the question above). But now we must map our ISO character names, such as "ü" (which XML display software by itself knows nothing about), to their Unicode equivalents, in this case "ü" (which all XML software understand). This mapping is accomplished in the same way, by referencing a bunch of character entity sets, but for a number of reasons the old sets used in SGML systems won't work. What you need are character entity sets that have been "XML-ized". For a set of these, see those produced by Rick Jelliffe and available through the XML Cover Pages.
Once you have these character entity sets, you need to declare them. One approach would be to declare and reference in the DOCTYPE Declaration subset just the character sets you need. A more complete solution is to declare only a single entity in the subset, and then let that entity declare and reference all the various character entity sets you want. This is essentially what is happening in EAD SGML, it's just that the declaration of "eadchars" is occurring within the DTD and is thus automatic. In XML, the declaration of eadchars has been turned off. We can't in fact use eadchars, but must create a new version of it for use in XML (for an example, see the one in use at Cornell). We then declare and reference this entity file by including something resembling the following lines in our DOCTYPE Declaration subset:
<!ENTITY % xmlchars PUBLIC "-//Cornell University::Cornell
University Library//DTD xmlchars.ent (Special Characters)//EN//XML"
"../entities/xmlchars.ent">
%xmlchars;
This goes right alongside the entity "eadnotat" declaration. See the example of the XML DOCTYPE Declaration above.
What sort of tools are available for working with EAD in XML?
There are a number of software tools available for authoring, validating, and transforming XML documents. A primary consideration is cost. Generally, free XML software is a bit more difficult to use, and you may need different packages to accomplish everything. Commercial products provide more functionality and ease of use but at a cost.
XML can be authored directly in a text editor, freeware or shareware. Some of these may give more XML support than others. On commercial XML authoring tools (XMetaL and WordPerfect 9), see The Cookbook, section 4.
The Cookbook also describes the use of the freeware product XT and the commercial XMetaL for transforming XML files into HTML (section 7).
Where can I find additional general information about XML and XSL?
The most complete sites, which include both general and detailed information, are The XML Cover Pages and the W3C pages having to do with XML, XSL, etc. The XML FAQ is also useful. Links to all of these sites are included in the Help Pages -- Readings on SGML/XML.