• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View
 

Digitizing for IR deposit  (PDF access file) (redirected from Digitizing for IR deosit (PDF access file))

Page history last edited by Monica 1 year, 9 months ago

This document provides key characteristics and tips for creating high quality PDFs from a printed book chapter, journal article or thesis.

 

Though preferable to have the original born-digital document to place online, in some situations this is not possible. For example, when there is no digital version or when the electronic version is obsolete or cannot be easily converted or when copyright concerns prevent using pre-existing electronic versions. For these reasons and others, may require scanning from a physical hard copy (owned by the library) in order to place materials online in the institutional repository.

 


 

Process overview

The process is based on testing of local scanning equipment and shared practices from the field. Pages are scanned as tiff images and then converted to PDF and OCRd . Scanning high quality page images and then down sampling produces a much higher quality PDF than scanning direct to PDF.

 

The primary purpose is re-use and readability of text. However, text can be exported and converted to alternative formats directly from the PDF for future possible migration purposes.

 

No master tiff page images are retained long term. This is NOT an archival scanning method. (For archival approach see Quality control checks for archival images )

 

Software needed: Photoshop, Adobe Acrobat Professional

 

Stages

Following info is intended as a sort of checklist. Step by step instructions are created per equipment. Contact DSS staff for details.

 

Scanning

  1. Scan each page as a single uncompressed tiff file (see filenaming techniques below)

  2. Lock page dimensions so all images are exact same width x height (do not include borders)

  3. Scan Text and line art pages @ 600 ppi B/W

  4. Scan pages with any color artwork or photographs @ at least 400 ppi 24-bit depth (color)

  • Do not scan blank pages

 

Post editing

In Photoshop and as needed.

 

  • Straighten pages (Analysis>Ruler tool | Image>Image rotation>arbitrary)

  • Remove any noise such as black lines, shadows, etc. (Marquee tool + Delete; Fill contents = white, normal)

  • Adjust any bended or warped text due to curve of book (Edit>Transform>Skew)

  • Restore any faded text (magic wand or Filter>other>minimum; Radius=1 or2 pixels )

 

For color pages

  • Crop any photographs and save as separate tiff files (will re-insert as last step)

  • Remove any text bleed-through by saving page to B/W (Image>mode>bitmap; Method=50% threshold)

  • Follow steps for text above if needed

  • Save tiff as RGB (Image>mode> RGB)

  • Remove any moire pattern on image insert (Filter>Gaussian blur; range = 1 or2 pixels)

  • Remove image from page (creates box for insert) and insert cropped image (File>Place)

 

 Examples of normalizing page images

 

Create PDF

In Adobe Acrobat X Pro

  1. Create> Combine Files into single PDF

  2. Remove any bookmarks

  3. Confirm all pages are included and in the proper order

  4. OCR: View>Tools>Recognize Text ; confirm language matches text AND use Searchable Image (but NOT exact) --> this will auto straighten pages

  5. Optimize: View>Tools>Document Processing; 60-80%

 

In Adobe Acrobat Pro DC

Search for "Optimize PDF" or "Recognize Text" for same menus. Short cuts or buttons have been setup in the quick toolbar

 

  • TIP: Optimization and OCR processes performed as separate steps produces a lower file size

  • Alternative Scenarios:

    • Instead of Optimization,use "Reduce File Size" . this option creates a much smaller file size and best used when converted from TIFF format (not recommended for visually dense images but good for most texts such as Thesis, Shepherd School Programs, but not journal articles containing images.)

    • For small length documents (less than 50 pages), may batch create PDFs using imageMagic commands

      • Create image-only PDF using imagemagik command: convert *.tif filename.pdf 
        • UPDATE: new command structure: magick *.tif filename.pdf
      • and follow up with OCR and Optimization or Reduce File Size steps

 

 

Filenaming

 

  • Label files sequentially (this will help when combining tiff files in conversion to PDF); may match suffix number to actual printed page numbers

  • Do not use spaces or special characters ( % \ / @ ! # )

  • Limit length to less than 32 characters

 

Examples:

  • Thesis = Author Last Name plus first initial as prefix, plus underscore then page number

    • JeffersonT.pdf or JeffersonT_001.tif

  • Journal Article PDF filename, use a condensed title, using dashes between words : Long-and-short-of-it.pdf

    • TIP: when scanning articles, usually use authors name for tiff files (as shorter) and then when naming PDF use the title

 

 

 

Resources

Royster, Paul, "The Art of Scanning" (2011). Digital Commons@ University of Nebraska-Lincoln.  http://digitalcommons.unl.edu/ir_information/67

 

 

Comments (1)

Monica said

at 1:21 pm on Oct 27, 2013

tinyurl this page: http://tinyurl.com/mxgob5m

You don't have permission to comment on this page.