• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View
 

Preserving Master Files in the IR

Page history last edited by Monica 7 years, 1 month ago Saved with comment

 

 Local Guidelines for uploading master files

 


 

General

 

Background

Our cultural heritage materials are scanned at high resolution, resulting in large bit sized master files. Past practices for long term storage  has been to backup master files to an independent secure file server outside the IR. Access to this server is heavily restricted; it is cumbersome to transfer files and it can be difficult to track down any particular file as the relationship between the master file and the IR record is not always obvious. Furthermore, we receive a number of patron requests for copies of master quality files, making the immediate retrieval of these masters a more urgent priority for supporting user needs.

 

To better manage our master files, we now are transferring them to the IR environment for long term management. In implementing this new storage procedure, we will limit masters to library staff access only and a custom ingest script has been developed to add masters files to existing records (in lieu of having to re-ingest the entire collection). Despite adding a very large amount of data to the IR environment, we do not anticipate any impact on IR performance since the masters are hidden from the public and the search function searches metadata and extracted text, not the actual files.

 

Benefits

  • Systematically linking master files directly to their metadata using the IR infrastructure (versus storing tiffs loosely on a backup server)

  • Provide direct access to master versions by archivists on an as needed basis

  • Inclusion of masters in the duracloud preservation strategy (masters are deemed higher risked assets as they are difficult and costly to re-create if lost or damaged)

 

For more information, please see Presentation slides: Digital Preservation: Tales from the Precipice between theory and practice 

 

Recommended Master File Formats

Preference is for sustainable file formats that retain greatest quality (note 1). It is also recommended to embedded key metadata within the digital file itself. Please see more on embedding photometadata.

 

Media

Preferred

Audio

WAV and/or Broadcast Wave @ 24 bit depth 96 kHz sampling rate

Video (note2)

- Motion JPE.G. 2000 (ISO/IEC 15444-4)(*.mj2), or

- AVI (uncompressed, motion JPEG) (*.avi), or

- QuickTime Movie (uncompressed, motion JPEG) (*.mov)

4:4:4 data sampling method

Images / Text (note3)

- TIFF (uncompressed)

- JPEG2000 (lossless) (*.jp2)

24 RBG-color bit depth 300 ppi

Note: resolution rates may vary depending on physical source (1)

 

 

Note 1: Sustainability of Digital Formats: Planning for Library of Congress Collections. Caroline R. Arms and Carl Fleischhauer. http://www.digitalpreservation.gov/formats/

 

Note 2: Video or moving image "master" formats may vary greatly depending on a number of factors. These may include: quality of source files, sustainability of source formats (e.g. codex), evaluation of content for preservation (is access quality sufficient?), etc. Each collection will need to be assess to recommend a "master" version. For more information please see  Audio-Video processing workflow

 

Note 3: Text based materials are typically scanned as tiff images (which become the "master" quality file) and access versions are typically the converted tiff files to OCR’d PDFs.

 

 

Master Ingest procedures

 

Batch Ingest (programming)

Batch ingests are performed by system admins, but require some initial data gathering and arrangement by collection curators.

 

 

Pre-steps

  • Confirm there are no duplicate object ids (e.g. check dc.identifier.digital/object folder to filename suffix)
  • Embed basic metadata into TIFF headers (eg. Source, Rights, id and title) See detail guidelines here: Embedded Image Metadata and Photometadata import script
  • Verify technical specs. All tiffs should be uncompressed and use a common or tested color profile (e.g. sRBB or local ICC (betterlight) or Adobe1998 (international standards ). Other color spaces may not be supported by downstream apps (eg. DSpace media filter to auto-generate thumbnails. eg. SWOT/CYMK color spaces not supported). TIP: use Exiftool tips to verify. Also see Using exiftool for QC process

 

General Steps

 

  1. Arrange digital files :

    1. master files may be organized into object subfolders (eg. lots of master files per item, such as pages of a book) or may be stored within one main directory (eg. lots of single tiff files such as individual photographs).

    2. Batch Ingest requires files to be ingested to a single collection. Therefore, need to arrange files into separate top level groups per collection.

 

  1. Prepare mapfile. A mapfile should contain two columns of data, matching  the digital id to uri (handle).

 

               Example Mapfile Contents:

 

               

    • mapfiles should be saved as a *.txt format (using a single space to separate values)
    • first column should be digital id, a space(not tab) and then the handle.
    • No headers
    • save the  mapfile with a distinct name, such as “mapfile” dash “CollectionName/ID”  (eg mapfile-HAAA.txt or mapfile-1911-36136.txt)
    • mapfile should ONLY contain data for items with maters files. For any items without masters, confirm this is documented in field: dc.digitization.specifications
    • Please prepare a mapfile for EACH collection. For communities comprised of multiple collections, each sub collection must have a separate mapfile.

 

3. Create lighthouse ticket to request a batch ingest from DSpace programmers. Describe arrangement of files (e.g. are files arranged in subfolders?) , provide mapfile or it's location on server and the collection URL where these files should be added.

 

 

Mapfiles can be prepared in a number of ways. The below examples provide different methods for identifying master files and their corresponding digital objects ids. This information can later be used to validate that the master ingest was successful.

 

  • Typically the digital identifier is the Prefix number of the actual filename (e.g. wrc00356 | wrc00356_001.tif)

 

If the collection was created following current file management practices, then the data for a mapfile already exists as part of the item metadata record (Method 1). This may not be the case for legacy collections or collections created by Vendors. Please see methods 2 or 3 for such cases. Also Methods 2 and 3 contain data at the file level that can be helpful in troubleshooting any issues where total files ingested does not match total master files on the server.

 

METHOD 1: IR metadata

 

  1. Export metadata from DSpace
  2. Extract two columns to a separate worksheet: dc.identifier.uri  and dc.identifier.digital
  3. Convert  dc.identifier.digital data to handle id only (e.g. http://hdl.handle.net/1911/75412 to  1911/75412.)
    1. TIP: Use excel's  copy/replace command to replace "http://hdl.handle.net/" with nothing for entire column.
  4. Confirm count of total items in spreadsheet is equal to the total number of object folders on server and also to total items in DSpace.
    1. if any of these counts do not match you may be missing data and need to investigate.
  5. Save to *.txt file

 

 

METHOD 2: DOS commands (PC): 

 

  1. run DIR command for list of  master files per format
    1. run command: DIR *.tif /s /b /a-d >output.txt 
      1. this will give a list of tiff files (but not directories, /a-d) in all subdirectories (/s), listed in bare format (filename only, /b), saved to a new file called output.txt. (see detail steps here)
    2. This will also give a total count of master files to be ingested (post ingest, check this number in validation step below)
  2. Derive digital object ids from filenames
    • many individual files may make up one single digital object . for example, object wrc02770 is made up of these 5 files: wrc02770_001intro.mov, wrc02770_002.mov, wrc02770_003.mov, wrc02770_005.mov, wrc02770_004.mov
    • From the filename list prepared in above step, Parse out digital object id from filename (e.g. wrc02770 | wrc02770_002.mov).
      • TIP: In excel can use "text to column" wizard found in the Data Tab, to separate filenames using underscore as deliminator.
    1. Create pivot table to "summarize"  digital IDs (filename's Prefix is the digital object ID)
  3. Add IR handle info. Lookup handle per digital object ID
    1. tip: use excel vlookup function to match up object id to handle from metadata exported from IR
  4. Confirm count of total items in spreadsheet is equal to the total number of object folders on server and also to total items in DSpace.
    1. if any of these counts do not match you may be missing data and need to investigate.
  5. Save data to *.txt file. (Two columns: digital object ID and handle)

 

 

 

METHOD 3: Exiftool (PC or MAC): 

 

  1. run command: c:\>exiftool -csv -r group >output.csv (see detail steps here) and open in excel.
  2. Create pivot table to "summarize" count of number of files (column: Filename) per Folder (column: Directory)
  3. Add IR handle info. Lookup handle per digital object ID. (filename's prefix number is the digital object ID)
    1. TIP: use excel vlookup function to match up object id to handle from metadata exported from IR
  4. Confirm count of total items in spreadsheet is equal to the total number of object folders on server and also to total items in DSpace.
    1. if any of these counts do not match you may be missing data and need to investigate.
  5. Save data to *.txt file. (Two columns: digital object ID and handle)

 

 

Post ingest : Validation step

 

Overview: Once ingest is complete, ensure masters have been uploaded completely. Compare data from the curation task : count masters to  directory listing of digital files taken from the server. if counts do not match, need to investigate differences and re-run ingest of any missing masters.

 

DSPACE  
SERVER

Count of master files

by item

Is

equal to

Count of digital files

by object

 

 

Data needed to verify all masters are in IR:

  • Curation task results
  • Exported metadata (needed only if there is a problem)
  • exiftool data  (needed only if there is a problem)

 

Run  Curation Task : Count Masters

  1. Log into DSpace
  2. Navigate to Community/Collection/Item
  3. Under “Context”, select Edit Community/Collection/Item
  4. In the Edit screen, select “Curate”
  5. Under Task, select “Count Masters” option
  6. Press “Preform”
  7. Results will appear; Copy/paste to excel.
    1. Tip: In order to parse this data, prepare it by adding a character such as a comma just before the beginning of the handle (find 1911/ and replace with ,1911/) then use "Text to Columns" to separate the data into columns by comma delimiter, then copy all the data and into a new sheet "paste special -- transpose" so that it displays all in column A going down into rows, and finally use "text to Columns" again separating by the space between the handle and the number of masters as the delimiter.

 

 

Validation steps:

  1.  Start by counting the number of handles (items) to make sure they are equal (total number of handles in curation task results matches total number of items in the collection, meaning there are masters for every item).
  2. Go to the directory where the master files are stored and check that the total number of masters (e.g. search *.tif) in that directory matches the total number of files in curation task file.
  3. If these numbers do not match, then you will need to compare on an item (handle to folder) basis:
    1. Run exiftool -csv command on master directory
    2. Create a pivot table to show # of master files (mimetype) per folder (directory) 
    3. use exported IR metadata to retrieval the digital id number (e.g. wrc#) per handle and aline with the curation task data
    4. Then compare the pivot table (file server data) to the updated curation task results (from step before)  to identify any disconnects by handle/folder

 

 

Troubleshooting errors

Identify which handles have reported mismatched numbers for master files – look in the directory and compare to the bundle in dspace to see what is missing. Create a new mapfile for the missing tiffs and send to programmer for ingest.

 

 

Example: Results from Curation Task:

 

Notice

The task, file counts was completed with the status: Success.

 

  • Results:

  • 1911/37604 2

  • 1911/37438 3

  • 1911/37780 1

  • 1911/37779 1

  • 1911/36097 1

  • 1911/37932 20

  • 1911/37447 1

  • 1911/37321 1

  • 1911/37814 1

  • 1911/37645 16

 

 

total count = 47 files

tip: can copy / past above data to excel and use “text to data” to parse out handle (item identifier) from count.

 

 

Delete Masters from Fonlibstor 

The final step is to delete master files (and the Project Folder if this ingest was one time upload). This is an important step as it frees up needed space on project server for processing new collections.  

 

Manual ingests (individual file, GUI interface)

Suitable for single file upload or when there is a very small number of master files to be added.

 

  1. Log into DSpace (must have collection admin privileges)

  2. Navigate to item

  3. Under “Context”, select “Edit item”

  4. At the Edit Item Screen, select “Item bitstreams”

  5. Select “upload a new bitstream” (located at bottom of list)

  6. In the “Upload a new bitstream” screen

    1. change Bundle type to Masters

    2. browse to add file (see recommended file types above)

    3. Press Upload button

  7. Should receive notice on screen “The new bitstream was successfully uploaded.” and file will appear under the Bundle: MASTER group in the Bitstream window.

  8. Delete Master files from Fonlibstor

     

Comments (0)

You don't have permission to comment on this page.