Preserving Master Files in the IR


 

 Local Guidelines for uploading master files

 


 

General

 

Background

Our cultural heritage materials are scanned at high resolution, resulting in large bit sized master files. Past practices for long term storage  has been to backup master files to an independent secure file server outside the IR. Access to this server is heavily restricted; it is cumbersome to transfer files and it can be difficult to track down any particular file as the relationship between the master file and the IR record is not always obvious. Furthermore, we receive a number of patron requests for copies of master quality files, making the immediate retrieval of these masters a more urgent priority for supporting user needs.

 

To better manage our master files, we now are transferring them to the IR environment for long term management. In implementing this new storage procedure, we will limit masters to library staff access only and a custom ingest script has been developed to add masters files to existing records (in lieu of having to re-ingest the entire collection). Despite adding a very large amount of data to the IR environment, we do not anticipate any impact on IR performance since the masters are hidden from the public and the search function searches metadata and extracted text, not the actual files.

 

Benefits

 

For more information, please see Presentation slides: Digital Preservation: Tales from the Precipice between theory and practice 

 

Recommended Master File Formats

Preference is for sustainable file formats that retain greatest quality (note 1). It is also recommended to embedded key metadata within the digital file itself. Please see more on embedding photometadata.

 

Media

Preferred

Audio

WAV and/or Broadcast Wave @ 24 bit depth 96 kHz sampling rate

Video (note2)

- Motion JPE.G. 2000 (ISO/IEC 15444-4)(*.mj2), or

- AVI (uncompressed, motion JPEG) (*.avi), or

- QuickTime Movie (uncompressed, motion JPEG) (*.mov)

4:4:4 data sampling method

Images / Text (note3)

- TIFF (uncompressed)

- JPEG2000 (lossless) (*.jp2)

24 RBG-color bit depth 300 ppi

Note: resolution rates may vary depending on physical source (1)

 

 

Note 1: Sustainability of Digital Formats: Planning for Library of Congress Collections. Caroline R. Arms and Carl Fleischhauer. http://www.digitalpreservation.gov/formats/

 

Note 2: Video or moving image "master" formats may vary greatly depending on a number of factors. These may include: quality of source files, sustainability of source formats (e.g. codex), evaluation of content for preservation (is access quality sufficient?), etc. Each collection will need to be assess to recommend a "master" version. For more information please see  Audio-Video processing workflow

 

Note 3: Text based materials are typically scanned as tiff images (which become the "master" quality file) and access versions are typically the converted tiff files to OCR’d PDFs.

 

 

Master Ingest procedures

 

Batch Ingest (programming)

Batch ingests are performed by system admins, but require some initial data gathering and arrangement by collection curators.

 

 

Pre-steps

 

General Steps

 

  1. Arrange digital files :

    1. master files may be organized into object subfolders (eg. lots of master files per item, such as pages of a book) or may be stored within one main directory (eg. lots of single tiff files such as individual photographs).

    2. Batch Ingest requires files to be ingested to a single collection. Therefore, need to arrange files into separate top level groups per collection.

 

  1. Prepare mapfile. A mapfile should contain two columns of data, matching  the digital id to uri (handle).

 

               Example Mapfile Contents:

 

               

 

3. Create lighthouse ticket to request a batch ingest from DSpace programmers. Describe arrangement of files (e.g. are files arranged in subfolders?) , provide mapfile or it's location on server and the collection URL where these files should be added.

 

 

Mapfiles can be prepared in a number of ways. The below examples provide different methods for identifying master files and their corresponding digital objects ids. This information can later be used to validate that the master ingest was successful.

 

 

If the collection was created following current file management practices, then the data for a mapfile already exists as part of the item metadata record (Method 1). This may not be the case for legacy collections or collections created by Vendors. Please see methods 2 or 3 for such cases. Also Methods 2 and 3 contain data at the file level that can be helpful in troubleshooting any issues where total files ingested does not match total master files on the server.

 

METHOD 1: IR metadata

 

  1. Export metadata from DSpace
  2. Extract two columns to a separate worksheet: dc.identifier.uri  and dc.identifier.digital
  3. Convert  dc.identifier.digital data to handle id only (e.g. http://hdl.handle.net/1911/75412 to  1911/75412.)
    1. TIP: Use excel's  copy/replace command to replace "http://hdl.handle.net/" with nothing for entire column.
  4. Confirm count of total items in spreadsheet is equal to the total number of object folders on server and also to total items in DSpace.
    1. if any of these counts do not match you may be missing data and need to investigate.
  5. Save to *.txt file

 

 

METHOD 2: DOS commands (PC): 

 

  1. run DIR command for list of  master files per format
    1. run command: DIR *.tif /s /b /a-d >output.txt 
      1. this will give a list of tiff files (but not directories, /a-d) in all subdirectories (/s), listed in bare format (filename only, /b), saved to a new file called output.txt. (see detail steps here)
    2. This will also give a total count of master files to be ingested (post ingest, check this number in validation step below)
  2. Derive digital object ids from filenames
    1. Create pivot table to "summarize"  digital IDs (filename's Prefix is the digital object ID)
  3. Add IR handle info. Lookup handle per digital object ID
    1. tip: use excel vlookup function to match up object id to handle from metadata exported from IR
  4. Confirm count of total items in spreadsheet is equal to the total number of object folders on server and also to total items in DSpace.
    1. if any of these counts do not match you may be missing data and need to investigate.
  5. Save data to *.txt file. (Two columns: digital object ID and handle)

 

 

 

METHOD 3: Exiftool (PC or MAC): 

 

  1. run command: c:\>exiftool -csv -r group >output.csv (see detail steps here) and open in excel.
  2. Create pivot table to "summarize" count of number of files (column: Filename) per Folder (column: Directory)
  3. Add IR handle info. Lookup handle per digital object ID. (filename's prefix number is the digital object ID)
    1. TIP: use excel vlookup function to match up object id to handle from metadata exported from IR
  4. Confirm count of total items in spreadsheet is equal to the total number of object folders on server and also to total items in DSpace.
    1. if any of these counts do not match you may be missing data and need to investigate.
  5. Save data to *.txt file. (Two columns: digital object ID and handle)

 

 

Post ingest : Validation step

 

Overview: Once ingest is complete, ensure masters have been uploaded completely. Compare data from the curation task : count masters to  directory listing of digital files taken from the server. if counts do not match, need to investigate differences and re-run ingest of any missing masters.

 

DSPACE  
SERVER

Count of master files

by item

Is

equal to

Count of digital files

by object

 

 

Data needed to verify all masters are in IR:

 

Run  Curation Task : Count Masters

  1. Log into DSpace
  2. Navigate to Community/Collection/Item
  3. Under “Context”, select Edit Community/Collection/Item
  4. In the Edit screen, select “Curate”
  5. Under Task, select “Count Masters” option
  6. Press “Preform”
  7. Results will appear; Copy/paste to excel.
    1. Tip: In order to parse this data, prepare it by adding a character such as a comma just before the beginning of the handle (find 1911/ and replace with ,1911/) then use "Text to Columns" to separate the data into columns by comma delimiter, then copy all the data and into a new sheet "paste special -- transpose" so that it displays all in column A going down into rows, and finally use "text to Columns" again separating by the space between the handle and the number of masters as the delimiter.

 

 

Validation steps:

  1.  Start by counting the number of handles (items) to make sure they are equal (total number of handles in curation task results matches total number of items in the collection, meaning there are masters for every item).
  2. Go to the directory where the master files are stored and check that the total number of masters (e.g. search *.tif) in that directory matches the total number of files in curation task file.
  3. If these numbers do not match, then you will need to compare on an item (handle to folder) basis:
    1. Run exiftool -csv command on master directory
    2. Create a pivot table to show # of master files (mimetype) per folder (directory) 
    3. use exported IR metadata to retrieval the digital id number (e.g. wrc#) per handle and aline with the curation task data
    4. Then compare the pivot table (file server data) to the updated curation task results (from step before)  to identify any disconnects by handle/folder

 

 

Troubleshooting errors

Identify which handles have reported mismatched numbers for master files – look in the directory and compare to the bundle in dspace to see what is missing. Create a new mapfile for the missing tiffs and send to programmer for ingest.

 

 

Example: Results from Curation Task:

 

Notice

The task, file counts was completed with the status: Success.

 

  • Results:

  • 1911/37604 2

  • 1911/37438 3

  • 1911/37780 1

  • 1911/37779 1

  • 1911/36097 1

  • 1911/37932 20

  • 1911/37447 1

  • 1911/37321 1

  • 1911/37814 1

  • 1911/37645 16

 

 

total count = 47 files

tip: can copy / past above data to excel and use “text to data” to parse out handle (item identifier) from count.

 

 

Delete Masters from Fonlibstor 

The final step is to delete master files (and the Project Folder if this ingest was one time upload). This is an important step as it frees up needed space on project server for processing new collections.  

 

Manual ingests (individual file, GUI interface)

Suitable for single file upload or when there is a very small number of master files to be added.

 

  1. Log into DSpace (must have collection admin privileges)

  2. Navigate to item

  3. Under “Context”, select “Edit item”

  4. At the Edit Item Screen, select “Item bitstreams”

  5. Select “upload a new bitstream” (located at bottom of list)

  6. In the “Upload a new bitstream” screen

    1. change Bundle type to Masters

    2. browse to add file (see recommended file types above)

    3. Press Upload button

  7. Should receive notice on screen “The new bitstream was successfully uploaded.” and file will appear under the Bundle: MASTER group in the Bitstream window.

  8. Delete Master files from Fonlibstor