XMARC (Version 0.1)
XML Mapping for MARC Data
geoff at minaret.biz
Copyright © 2001 Minaret Corp. - All Rights Reserved.
Draft version 0.1
November 28, 2001
This document describes a mapping between the U.S. MARC communications format and XML that handles any type of MARC data without any knowledge of its content or meaning. In effect, this is an XML communications format for MARC data. Its role is similar to that of the USMARC Specifications for Record Structure, Character Sets, Tapes that is published by the Network Development and MARC Standards Office of the Library of Congress. In other words, the mapping described here does not require any knowledge of how fields are used in a MARC record, it is only concerned with converting data beteen the MARC communications format and XML.
The goal of this mapping is to produce an XML document type that is as simple as possible to understand and implement, and one which never has to be updated. This document also proposes the use of non-validating XML parsers and the avoidance of an XML DTD (Document Type Definition) to further simplify matters. A well-formed XML document is all that is required. Two DTD's have been created for this mapping for those who would prefer to use them that accommodate both single records and sets of records. However, the DTD's are written to accommodate any MARC field and subfield that may be encountered and not to perform exacting validation of a MARC record. Validation and all of its associated complexity is left for other mappings and software systems.
To create a mapping that is easy to implement and understand, and
which will handle any type of U.S. MARC record.
This is accomplished by using existing MARC field and subfield names as the basis for their corresponding XML element names. This has the advantage of being easy to understand for anyone conversant in MARC terminology but is also extremely easy to implement by system designers and programmers. No sophisticated field to field mapping is necessary. In addition, no understanding of the content of the MARC record is necessary to perform the conversion. It does not matter whether the data is bibliographic, authority, community information, etc.
One of the most important considerations in this design is the ease with which it can be implemented. This should translate into a low implementation cost, especially compared to other, more intensive mappings. These factors will increase the likelihood that software will be implemented to utilize and create MARC records using XML tools.
Fixed Length Fields.
None of the MARC control fields (the leader and the 00x fields) are subdivided into smaller pieces but are kept intact. This approach has the significant advantage that is does not require any knowledge of the structure or meaning of individual character positions within these fields. It's extremely easy for any implementation that requires particular character positions from a fixed field to extract that data.
No XML attributes.
There are no XML attributes used in this specification. Attributes have limitations in XML and XML namespaces that make them less desirable than XML elements. This includes the lack of support in XML parsers for unexpanded entity references in attributes (should an implementor want to store unexpanded attributes) and the unfortunate fact that attributes can not belong to a default namespace.
Easy mapping in both directions.
With a few simple rules any MARC field and subfield can be mapped to an XML equivalent. The rules cover all existing and future MARC field and subfield names including local fields and subfields.
Easy to remember XMARC field and subfield names.
Anyone conversant in MARC tag numbers will find it extremely easy to use XMARC field names. With the exception of the leader from the MARC record (which is called leader in XMARC, all XMARC fields start with the letter "f" and are followed by the three digit MARC tag number. For example, the name of the MARC title field (245) is "f245" in XMARC.
The indicators are called "i1" and "i2" in XMARC and the subfields start with the letter "s" and are followed by the MARC subfield name. So subfield "a" in MARC is called "sa" in XMARC. All you need to remember is:
Field names start with an "f", indicators start with an "i" and subfields start with an "s". Each is followed by the original MARC field, indicator or subfield code.
No loss of MARC information.
Records can be converted back and forth between MARC and XML with no loss of MARC data. Records can be read from a MARC file, converted into XML, manipulated in XML and written back out in MARC without any unintentional data loss. However, any XML tags that are not defined in the XMARK mapping that have been added to an XMARC document will be ignored when the data is exported as a MARC record.
XMARC records may contain local data.
An XMARC record may contain non-XMARK elements and attributes that will be ignored when exporting to MARC.
Parsers do not need to be validating.
Because the XMARC mapping is envisioned first and foremost as a communications format, there is little need for XMARC documents to be constantly validated. XMARC documents that have been generated from MARC data should be automatically valid. XMARC documents that have been created from scratch or have been modified should be validated by the application that has made these changes.
By not using a document type declaration at the start of every XMARC document, XML parsers will run much faster. In addition, DTD validation can only provide very basic forms of record checking and will not do anything to enforce AACR2 cataloging rules or even MARC communications format requirements.
It is envisioned that any software that converts XMARC data into the MARC communications format will play a significant role in insuring the correctness of the resulting file.
Document Type Definitions.
The following document type definitions have been created for XMARC:
xmarcset (xmarcset.dtd) for when there is more than one XMARC record in a file. This DTD defines a single element (xmarcset) that contains an xmarc element for each MARC record in the file.
xmarc (xmarc.dtd) for cases where there is only one XMARC record in a file. This DTD defines a single element (xmarc) that contains all of the field and subfield definitions for this mapping.
xmarc core (xmarc.ent) is an entity declaration file that contains all of the actual declarations for the xmarc data element. This file is included by the two DTD's listed above. Because every possible MARC field that may be encountered is defined (1000 fields) along with every possible subfield (59 subfields) for each field, you may not want to include a document type declaration (DOCTYPE) in all of your XMARC documents.
Not yet documented.
Character sets and character conversion.
Not yet documented.
All letters in XMARC names are lower case.
The document level field in an XMARC file with one record will be called "xmarc".
The document level field in an XMARC file with multiple MARC records will be called "xmarcset", which may contain one or more xmarc elements, one for each MARC record.
Each xmarc element may contain a leader element and one or more field elements. The leader element contains the leader information from the MARC record.
The MARC control fields (001 through 009) are represented by XMARC elements f001 through f009. Each of these XMARC elements holds the entire contents of its corresponding MARC field as a single chunk of data. These elements have no sub-elements.
The MARC data fields (010 through 999) are represented by XMARC elements f010 through f999. Each of these XMARC elements may contain one or two indicator elements and one or more subfield elements.
The two MARC indicators are called, "i1" and "i2" in XMARC.
MARC subfields "a" through "z" and "0" through "9" in MARC, are called "sa" through "sz" and "s0" through "s9", respectively, in XMARC.
The following subfield names have been set aside for local use by the MARC specification:
! " # $ % & ' ( ) * + , - . / : ; < = > ?
As this are almost all illegal XML name characters, these subfield names are to be mapped as follows:
All other subfield names not explicitly listed here are illegal (per the MARC specification) and should result in an error by any implementation of XMARC.
Creating a MARC Export File from XMARC Records
This is offered simply as a guideline for how a MARC export file might be created from an input file of XMARC records. It does not cover the details of how to generate a proper MARC record.
In general, only those elements that have been defined as an XMARC element are exported. This permits XMARC records to include non-MARC data that is stripped out during the creation of the MARC export file. It also handles the case of a set of XMARC records using a document level xmarcset element as well as a single record using a document level xmarc element.
A parser will scan an XMARC input file for any occurrences of xmarc elements. Each xmarc element and its descendents are exported as a MARC record. All other elements are ignored.
Within the xmarc element, only those child elements with the following names are exported: leader (parts of which must be re-calculated as part of the export) and fddd (where ddd is a three digit decimal number).
For fields f001 through f009 all text data from each field is extracted (stripping any suplemental markup) and exported. Care must be taken with these fields in particular to insure that XML white space processing does not alter the value of these fields. It is very important that the length of these control fields does not changed unknowingly and that the data is not shifted to the left because of whitespace removal.
For fields f010 through f999, only child elements with the following names are exported as part of that field: i1, i2, sa through sz, and s0 through s30 (in record order, of course).
When exporting the content of the XMARC indicator and subfield elements, all of their text is exported without any additional XML markup that may exist in the record. For example, if subfield sa looked like this:
<sa>A link to <a href="...">some place</a> on the web.</sa>Assuming that the HTML anchor element had been added during XMARC processing, this field would be exported without the element markup but still include its content, thus:
<sa>A link to some place on the web.</sa>