Physical Description information in Archivists’ Toolkit, ArchivesSpace, ICA-AtoM, and EAD3

We’ve been taking a closer look at how we create and encode Physical Description information.  Our deployment of the Archivists’ Toolkit does not employ the multiple <extent> plugin so we’re a little limited as to how we enter Physical Description information.  We’ve been using the Toolkit “Container Summary” field for recording information about the containers (i.e., boxes, folders, etc.).  The Toolkit exports this information as a simple <extent> note that appears immediately after the regular <extent> note (which provides information about the archival materials, not the containers).

We’re watching the developments with ArchivesSpace, the open-source platform that will supersede the Toolkit later this year.  The ArchivesSpace team is developing a migration tool to migrate data from the Toolkit MySQL database (as well as the database that powers the Archon platform) into the ArchivesSpace MySQL database so users don’t have to import/export EAD or CSV accession records. This got me wondering about the destination of the Container Summary field in ArchivesSpace (see this ArchivesSpace user group thread for more info). We want to be able to distinguish between the two <extent> notes when we’re working with exported EAD and it appears the only way to do that will be to tweak the export routine in the Ruby source code.

At the same time, as we export our finding aids for publishing in our ICA-AtoM database, we’re noticing that Atom properly handles the multiple <extent> notes nested within the first <physdesc> element. But we’re also noticing issues with how AtoM handles multiple Physical Description elements. Specifically, it does not import repeating <physdesc> elements and it does not import the <physfacet> element (see this AtoM user group thread for more info). We’re currently looking at whether we should revise our XSLT to merge multiple <physdesc> notes into one note or tweak the AtoM import code so it accepts the code.

All of this prompted me to make a post to the Encoded Archival Description list to see how other folks are encoding Physical Description information (see thread #7 in the list archives). This generated some helpful discussion but also brought up the pending revisions to Encoded Archival Description (EAD) and its complete overhaul of the <physdesc> elements (see Mike Rush’s response for more info).

The changes will really help us handle complex physical description information that’s required for audiovisual material but it will be some time before ArchivesSpace or ICA-Atom support EAD3. This is probably a good thing as it will allow us more time to think about how the revisions to EAD will affect the display of the descriptive data. For example, with our Rules for Archival Description, the explanatory text allowed under <physdescstructured><descriptivenote> would not typically be nested within the data included in <physdescstructured>.  One of RAD’s many idiosyncrasies is that it asks for descriptive notes relating to the physical description to be included in the “Notes Area,” which comes after the Physical Description Area and Archival Description Area (see Rule 1.8B9: I don’t think that rule is used all that often because in practice, most people just include the note in the Physical Description Area, but it is there nonetheless. This will only become important if anyone attempts to design a stylesheet to display EAD3 data according to the structure prescribed in RAD; they would *technically* need to move the data contained in <physdescstructred><descriptivenote> and place it alongside other elements that appear outside of <did> (e.g., <originalsloc>, <otherfindaid>, etc.).

RAD clearly needs some revisions, but that is another story! The moving targets are making it a little difficult to develop procedures for creating and entering physical description information – do we use granular EAD elements or lump everything together into one <physdesc> note?  Our database holds a bit of both, so we will likely need to account for various EAD scenarios when we work on our XSLT, and AtoM database.

physfacet notes become physdesc notes during import

We’re currently in the middle of a project to convert several legacy MS Access databases into EAD so the data can be imported into the Toolkit.  First, a historical side note:

Our old procedures for creating finding aids looked something like:

  • Copy an existing MS Access database used to create finding aids (these are simple databases with two tables: one for series-level descriptions and one for file-level descriptions)
  • Change available series titles (these are pulled from a drop-down list)
  • Enter data
  • Create queries to pull all file-level descriptions for each series (these queries could be copied from database to database and edited as needed)
  • Create reports that merge the query results into basic EAD tags (these reports could be copied from database to database and edited as needed)
  • Save each report as a text file
  • Merge each report into a fonds-level EAD template (the fonds-level description must be prepared and pasted into the template before merging the Access reports)
  • Convert EAD into HTML using XSL transformation in Oxygen or XMetal
  • Mount HTML on Archives website

This worked well enough for a number of years, but you can see all the problems (duplicate data, onerous time commitment, inconsistent encoding practices, etc.). We’ve already migrated all our legacy EAD into the Toolkit, but there were a handful of large databases that never had EAD exported.   The data just resides in the MS Access database.

Migrating the legacy EAD enlightened us to the many inconsistencies in encoding practices, including:

  • Various “container types” and “extent types”
  • Dates in the title field
  • Incorrect encoding of subject headings

Rather than try to dust off the old procedures (which caused numerous problems when the EAD was imported into the Toolkit), we decided to hire a casual data technician to clean the data the right way.  It has been going really well, but the MS database for the imX Communications fonds (with 10,000+ records) threw a few curveballs.

The database has numerous fields for “Extent” and “Other Physical Details” information (mostly used for film and other a/v material).  We thought: since we have someone who can properly parse this data, why not encode the different fields in proper EAD elements?  By this logic, information that falls under the RAD “Other Physical Details” notes should be wrapped in <physfacet> tags (see this post about crosswalks for more info) .

Unfortunately, it’s not so simple.  The import maps for the Archivists’ Toolkit do not allow for this level of specificity.  Data wrapped in <physfacet> tags are imported, but they are stored as Physical Description Notes, not “Other Physical Details” notes.  This means they could be imported as <physfacet> but exported as <physdesc>.

We are now looking into whether it would work to import these “Other Physical Details” notes with some kind of prefix that could be used in a post-import MySQL query that would change the notesEtcTypeID and chop off the prefix.  Stay tuned…


Merging extent types

As we march along with our migration from MS Access to the Archivists’  Toolkit, we’ve been noticing that our list of extent types has become unwieldy.  Our legacy finding aids contained extent types like:

  • cm of textual record; 4 maps
  • 10 centimeters
  • 34 centimetres
  • 2 metres; 12 blueprints

When we migrated our legacy EAD, these terms were imported to AT and added to the list of available extent types.    Additional terms were added to the list during routine accessioning and processing.   Over time, the list grew to include redundant or incorrect extent types.   Vague terms like “boxes” and “item” began appearing.   A repository profile report confirmed the authority control problem by showing the total volume of each measurement divided between the various incarnations of the extent type (e.g., cm,centimetre centimetres, etc.).

At the same time, our project to migrate AT finding aids to ICA-Atom has helped to clarify some of our description practices.  Seeing how physical description information is passed through our legacy finding aids into the AT and then into ICA-Atom has done much to inform our approach to physical description.

So, I set out to fix the problem by creating a set of new, RAD-compliant extent types:

  • cm of textual records
  • cm of textual records and other material
  • cm of graphic material
  • cm of multiple media
  • m of textual records
  • m of textual records and other material
  • m of graphic material
  • m of multiple media

These extent types would address a related issue: how to enter the “specific material designation” in accordance with RAD (see Rules 1.5B1, 1.5B3, and 3.5B1).  All I had to do was merge the old, incorrect measurements into the new, correct measurements.  That is where things went terribly wrong.

The merger of the various “metre” extent types worked well, but the merger of the various “centimetre” extent types failed somehow.  Well, some of them worked.  But I found that I had “cm of textual records” and “cm of textual record.”  I incorrectly thought the term should be “cm of textual record” (in fact, the phrase in Rule 3.5B1 is “textual records“), so I tried to merge the two terms and it did not work.  It would’ve been a substantial number of updated records (ca. 5,000).  The merger finished, but I could still see “cm of textual records” in the main list view.

What was really troubling is that different terms would appear inside the open resource record.  Some records showed the correct term (see screenshot below), but others had blank extent types and others showed completely different terms (e.g., volumes, boxes, etc.).

Screenshot of AT resource record with incorrect extent type

So, there’s the problem.  It was happening everywhere – resource records, accession records, deaccession records, etc.  It was a little nerve wracking!  Were incorrect terms overwriting the extent types that failed to merge?  Did we save the incorrect or blank extent types when we opened and saved resource and accession records?

The answer is no.  The incorrect and blank terms were strange, but they were not overwriting the extent types in records where the merger failed.  phew!

Here is the solution provided by our systems developer, who helped troubleshoot and fix the problem:

Step 1:

Backup your database with a mysqldump

Step 2:

– all the accessionId where the extentType is ‘cm of textual records’

SELECT accessionId FROM Accessions WHERE extentType=’cm of textual records’

Step 3:

Copy the id’s into a program that lets you do regular expression conversions like TextPad ( and preform a regular expression on the id’s to create your update statement.

TextPad Example:

  1. Paste ID’s into textpad, one ID per line
  2. Press F8 or go to Search->Replace
  3. The following values should be entered
    1. Find what: .*
    2. Replace with: UPDATE `ATK`.`Accessions` SET `extentType`=’cm of textual record’ WHERE `accessionId`=’&’;
    3. Conditions: Check the ‘Regular Expression’ checkbox
  4. Click Replace All

Step 4:

You should now have a list of update statements similar to…

UPDATE `ATK`.`Accessions` SET `extentType`=’cm of textual record’ WHERE `accessionId`=’5′;

UPDATE `ATK`.`Accessions` SET `extentType`=’cm of textual record’ WHERE `accessionId`=’11′;

UPDATE `ATK`.`Accessions` SET `extentType`=’cm of textual record’ WHERE `accessionId`=’15′;

UPDATE `ATK`.`Accessions` SET `extentType`=’cm of textual record’ WHERE `accessionId`=’20′;

UPDATE `ATK`.`Accessions` SET `extentType`=’cm of textual record’ WHERE `accessionId`=’21′;

Step 5:

Run your update statements in MySQL

Step 6:

Repeat Steps 2-5 for all affected tables.

In our cases we modified, Accessions, Deaccessions, Resources, ResourcesComponents.  Of course, we’ll have to change the extent type back to “cm of textual records,” but at least we’ve identified a potential reoccurring problem and how to solve it.

Some observations on AT for RAD finding aids

A few more general observations about the AT based on our migration project:

  • AT does not allow for multiple titles, extent notes, and other kinds of information. It would be really nice if we could, for example, add two extent notes to one file-level description.  We have find clever ways to deal with files that have 2 cm of textual records and 3 photographs.  The same goes for dates and titles.  Brigham Young University has created a couple plug-ins that provide this kind of functionality, but when I installed them, the new data entry fields covered over the existing extent and date information.  Apparently ArchivesSpace will be providing something to this effect.
  • AT is not suitable for RAD item-level description. It doesn’t have a space for statements of responsibility, edition statements, publisher/manufacturer information, and the publisher’s series area.   Some of this information could be merged into another data field, or provided in a general note, but it wouldn’t export proper EAD.
  • A lot of EAD fields don’t export to the proper MARCXML field. Many of the most important fields crosswalk correctly, but the GMD, parallel title, other title information, statement of responsibility, material specific details, and other key information export to an incorrect MARC field.  This may not matter – we’ve apparently been able to import an AT generated MARCXML file with no problem, but we’re only doing fonds-level MARC records so the more complicated data just isn’t there.

We’re working on expanding this RAD crosswalk to include AT data entry fields and the actual EAD and MARCXML exported from the program.  So far, it’s been a big help identifying places where the program is not suitable for RAD finding aids (e.g., statements of responsibility).  But we’re not worried about what we’ve found so far because most of the issues deal with file-level description or obscure data.  I’ll post the crosswalk when it’s finished.