We’re currently in the middle of a project to convert several legacy MS Access databases into EAD so the data can be imported into the Toolkit. First, a historical side note:
Our old procedures for creating finding aids looked something like:
- Copy an existing MS Access database used to create finding aids (these are simple databases with two tables: one for series-level descriptions and one for file-level descriptions)
- Change available series titles (these are pulled from a drop-down list)
- Enter data
- Create queries to pull all file-level descriptions for each series (these queries could be copied from database to database and edited as needed)
- Create reports that merge the query results into basic EAD tags (these reports could be copied from database to database and edited as needed)
- Save each report as a text file
- Merge each report into a fonds-level EAD template (the fonds-level description must be prepared and pasted into the template before merging the Access reports)
- Convert EAD into HTML using XSL transformation in Oxygen or XMetal
- Mount HTML on Archives website
This worked well enough for a number of years, but you can see all the problems (duplicate data, onerous time commitment, inconsistent encoding practices, etc.). We’ve already migrated all our legacy EAD into the Toolkit, but there were a handful of large databases that never had EAD exported. The data just resides in the MS Access database.
Migrating the legacy EAD enlightened us to the many inconsistencies in encoding practices, including:
- Various “container types” and “extent types”
- Dates in the title field
- Incorrect encoding of subject headings
Rather than try to dust off the old procedures (which caused numerous problems when the EAD was imported into the Toolkit), we decided to hire a casual data technician to clean the data the right way. It has been going really well, but the MS database for the imX Communications fonds (with 10,000+ records) threw a few curveballs.
The database has numerous fields for “Extent” and “Other Physical Details” information (mostly used for film and other a/v material). We thought: since we have someone who can properly parse this data, why not encode the different fields in proper EAD elements? By this logic, information that falls under the RAD “Other Physical Details” notes should be wrapped in <physfacet> tags (see this post about crosswalks for more info) .
Unfortunately, it’s not so simple. The import maps for the Archivists’ Toolkit do not allow for this level of specificity. Data wrapped in <physfacet> tags are imported, but they are stored as Physical Description Notes, not “Other Physical Details” notes. This means they could be imported as <physfacet> but exported as <physdesc>.
We are now looking into whether it would work to import these “Other Physical Details” notes with some kind of prefix that could be used in a post-import MySQL query that would change the notesEtcTypeID and chop off the prefix. Stay tuned…