Recommended file formats



 

Chart

Media

Preferred Master

Access version

Audio

WAV and/or Broadcast Wave

@ 24 bit depth

96 kHz sampling rate

MP3 (streaming format for quicker access and copyright issues)

Video

(under review)

- Motion JPE.G. 2000 (ISO/IEC 15444-4)

*.mj2), or

- AVI (uncompressed, motion JPEG) (*.avi), or

- QuickTime Movie

(uncompressed, motion JPEG)

(*.mov)

4:4:4 data sampling method

*.mpeg, wrapped in AVI, MOV - MPEG-4 (H.263, H.264) (*.mp4, wrapped in AVI, MOV)

Images (*)

- TIFF (uncompressed)

- JPEG2000 (lossless) (*.jp2)

@ 24 RBG-color bit depth, 300 ppi scanned at actual size OR

Visual arts materials capture at 3000 to 5000 pixels on the long sized

- JPEG/JFIF (*.jpg)

- JPEG2000 (lossy) (*.jp2)

- GIF (*.gif)

- BMP (*.bmp)

- PNG (*.png)

Text (**)

- Plain text (encoding: USASCII, UTF-8, UTF-16 with BOM)

- XML (includes XSD/XSL/XHTML, etc.; with included or accessible schema and

character encoding explicitly specified)

 

- Plain text (ISO 8859-1 encoding)

- PDF (*.pdf) (embedded fonts) and OCR’d

- Rich Text Format 1.x (*.rtf)

- HTML (include a DOCTYPE declaration, must be “self-contained, non-dynamic HTML documents”) see Fondren Perma.cc libguide

- PDF/A

- PDF/X

- Open Office (*.sxw/*.odt)

Spreadsheet/

Database (***)

- Character delimited text (ASCII or Unicode UTF-8 preferred)

 

- Comma Separated Values (*.csv)

- Delimited Text (*.txt)

- Excel (*.xlsx)

- DBF (*.dbf)

- OpenOffice (*.sxc/*.ods)

Presentation

 

- OpenOffice (*.sxi/*.odp)

- PowerPoint (*.ppt)

- PDF (*.pdf) (embedded fonts) and OCR’d

- PDF/A

 

Since the Rice University’s Digital Scholarship Archive (RUDSA) has a broad goal to collect materials that support the university’s research and scholarship mission, it is unclear what formats these materials may ultimately take. Given that RUDSA is already made up of heterogeneous types of materials, it follows that a mixed approach to file format support is warranted. The above list of recommended file formats is not intended to be an exhaustive list, but rather focuses on existing repository content or reasonably expected content. Additional formats may be deemed appropriate on an as needed basis. Please consult with staff from the Digital Scholarship Services department with any questions. (Email)

 

Notes:

(*) The specs of 24 RBG-color bit depth, 300 ppi scanned at actual size is a generalize standard and will provide high quality images for most archival documents.  However there are many special cases where different specs are required based on the source document or when producing for print publication. E.g. slides, oversized maps, small printed items (newspapers), etc. Please see additional specifications based on source document and guidelines please see Quality control checks for images.

 

Case in point: Library of Congress digital preservation resources recommend 300 dpi/ppi for 4×8, 5×7 and 8×10 photos. This is example of how the source material (photograph) and original size impact targeted specs.  For interesting discussion on resolutions settings, see LOC Blog Post: You Say You Want a Resolution: How Much DPI/PPI is Too Much? | The Signal: Digital Preservation

 

(**) Textual based documents may be scanned as hi-res tiffs and then converted to OCR'd documents for access. So master versions for text based items may only include the tiff files.

 

General Rule of thumb: For Text-based PDFs with TIFF masters, target 150ppi for access version. For image-based PDFs, target 300ppi for access version

 

(***) 

All data types should include data dictionary of significant properties.

● Spreadsheets should be ‘self-describing’ and able to exist independently. Create meaningful row and column headings, and describe the units used within the spreadsheet. Use controlled vocabularies and established word lists where possible for data entry to ensure consistency and clarity of the data (ADS, 2009).

● The Library of Congress (2020-2021) recommends copying macros or removing them entirely for long-term preservation.

 

 

Master formats

From a digital preservation perspective, it is preferable to have high-resolution master versions for all digital resources. When new derivative file formats are invented or become prevalent, then new access versions may be created from these high quality source files. For some file types, the master and access versions may be the same (e.g. PDFs, JP2).

 

Preferred formats

To ensure the enduring accessibility of digital resources, it is preferred to have file formats that are deemed to be of long-term sustainability. A generally accepted guideline for selecting file formats for long-term access is Caroline R. Arms and Carl Fleischhauer’s “Sustainability of Digital Formats” [1]. This document lists general factors that help determine the stability and continuous usefulness of a file format and include qualities such as:

 

These broad-based factors as well as local considerations such as context, functionality and specific user needs and expectations for a particular collection, help determine the list of preferred formats listed in the table above.

 

Technical Metadata

 

References

  1. Arms, Caroline and Carl Fleischhauer. Digital Formats: Factors for Sustainability, Functionality, and Quality. IS&T Archiving 2005 Conference, Washington, D.C. memory.loc.gov/ammem/techdocs/digform/Formats_IST 05_paper.pdf (also see http://www.digitalpreservation.gov/formats/)

 

  1. Brown, Adrian. “Selecting File Formats for Long-Term Preservation.” London: The National Archives (June 19, 2003) http://www.nationalarchives.gov.uk/documents/selecting_file_formats.pdf

 

  1. De Vorsey, Kevin and Peter McKinney “National Digital Heritage Archive, Digital Preservation in Capable Hands: Taking Control of Risk Assessment at the National Library of New Zealand”. (Spring 2010) Information Standards Quarterly (ISQ), Volume 22, Issue 2, Special Issue: Digital Preservation. http://www.niso.org/publications/isq/2010/v22no2

 

 

Additional Guidelines 

Library of Congress: Frequently Asked Questions (FAQ) for Digital Scan Services. http://www.loc.gov/duplicationservices/customer-service/faq/ 

 

ALCTS Minimum Digitization Capture Recommendations. The Association for Library Collections and Technical Services Preservation and Reformatting Section (2013) http://www.ala.org/alcts/resources/preserv/minimum-digitization-capture-recommendations

 

FADGI Guidelines: Technical Guidelines for Digitizing Cultural Heritage Materials. Federal Agencies Digitization Guidelines Initiative Still Image Working Group (Draft 2015. Prior report 2010) http://www.digitizationguidelines.gov/guidelines/digitize-technical.html 

 

Preserving Data Types Series from Artefactual Systems and the Digital Preservation Coalition:

 

The Significant Properties of Spreadsheets (OPF AIG Final Report). 2021.  doi: 10.5281/zenodo.5468116.