[Standards]
[Library of Congress] [Smithsonian] [Condist] [Stanford]
[Getty AHIP] [Virginia] [GLC]
[File formats] [Software] [Miscellaneous: Hardware]
[People with Neat Ideas (MIT "web-ho")] [Comments]
11/2001: Some of the best places to start are right here:
These are notes on standards for digital images for cultural institutions like libraries, archives and museums. Topics include file formats, resolutions and how to organize. I always plan to organize this better, but never succeed. -- Paul
At some point, fit this url into the list: Scanning Resolution Calculator
File Formats. When I originally wrote these notes in 1998, there was no single agreed-upon standard for the resolution of permanent "archive" digital images. As of 2002, there is a consensus in the 300-600 dpi range, with most around 300-400. (I attribute this clustering in the 300-400 dpi range to result --in part-- from the proselytizing of U-Va's brilliant David Seaman, who teaches regular courses at Rare Book School.) Most discussion get hung-up on issues that involve current technology or use, not future options or use. Thankfully, people are starting to follow the lead of major institutions. For institutions expected to last many lifetimes, long-term plans need to win over expedience, thus open-source models (SGML, XML, TIFF) trumps proprietary ones (WordPerfect, Word, Access or GIF).
Kodak PhotoCD. In 1998-2000, in discussion groups like Archives-L, many respondents seem to be using Kodak PhotoCD as their "archival" (permanent master) copy. The PCD format is a brilliant solution in many ways: a single file contains multiple resolutions; easy to make for contractor; high color saturation. But archivists, librarians and curators should be concerned about a file format tied to a company that has introduced such innovations as the 110 instamatic. It looks like the PhotoCD will be with us for a while, but the file format has some limitations for higher-level publication. It is proprietary and Kodak may change the format. I think it's a fine transitional format and excellent as a master for individuals, but not for medium to large institutions or for professional photographers.
New 8/98. Cornell University Library's Preservation and Conservation Department has produced a study on Kodak PhotoCD: Using Kodak Photo CD Technology for Preservation and Access: A Guide for Librarians, Archivists, and Curators by Anne R. Kenney and Oya Y. Rieger, Department of Preservation and Conservation, Cornell University Library. <http://www.library.cornell.edu/preservation/kodak/cover.htm>
Every library, archive or museum seems to agree that uncompressed TIFFs offer the best chance for future usability. Compression (e.g. GIF, JPEG) involves loss of colors and resolution which is irreversible and thus inappropriate for permanent storage. However, GIF and JPEG are preferred means of delivery, especially on the Web. (Future file formats for delivery may include Portable Network Graphics, or PNG, and HP FlashPix.)
At the Gilder Lehrman Collection, we decided to standardize on uncompressed TIFF files (using PC bit ordering). We had tentatively chosen a 400 dpi color resolution for these TIFFs (the resolution suggested by David Seaman at the University of Virginia), but dropped to 300 dpi color because of staffing and storage problems. (Imaging was driven by a long-term but very time-sensitive project.)
But here I need to add information about published standards.
LC is the granddaddy of digitizers. Below are citations to an RFP (Request for Proposal) to vendors for scanning for American Memory. The second item is more important.
The RFP specifies 300 dpi, 24 bits, JPGs; bitonal materials scanned at 200/300/400 dpi but 1 bit per pixel. [I suspect they weren't sure what they wanted at this date.-PR]
"Digital Formats for Content Reproductions" by Carl Fleischhauer [American Memories Format Standards (1996)] <http://lcweb2.loc.gov/ammem/formats.html#II>
This document lists specs for many file uses (thumbnail, reference and Archive). A lot of good sense in this article. Here is a quotation from the specifications for the archival file:
Archive
An uncompressed (or, in the future, lossless-compressed) image free of the artifacts resulting from lossy compression, provided to users for reproduction or held for future reprocessing as compression standards change. Not provided at this time; may be provided to users as a downloadable file in the future.
Grayscale: 8 bits per pixel; color: 24 bits per pixel
TIFF (Tagged Image File Format)
Uncompressed
Moderate class ranges from about 500x400 to about 1000x700 pixels; higher resolution class (LC examples coming in future) will range from 2000x1400 to 4000x3000; only the highest resolution will be archived.
The Smithsonian's National Museum of Natural History also wrote what is essentially another RFP, but it appears to be based upon the experience of LC in digitization and seems oriented towards illustrations (and possibly artifacts?). I was impressed in spite of my usual skepticism about the Feds.
"Requirements and Options for the Digitization of the Illustration Collections of the National Museum of Natural History" <http://www.nmnh.si.edu/cris/techrpts/imagopts/contents.html>
I found the report's summary especially useful <http://www.nmnh.si.edu/cris/techrpts/imagopts/summary.html> A lot of good sense here. (Not mentioned but implicit is the lossiness in going from original to transparency to paper.) Excerpt here (with emphasis added by me):
For high resolution images, a spatial sampling frequency (sometimes known as the true optical resolution) of at least 600 pixels per inch (ppi) should be employed for all gray tone illustrations and for smaller color illustrations. For larger color illustrations, a sampling frequency of at least 300 ppi should be used. At least 8 bits per pixel for gray tone illustrations and at least 24 bits per pixel (8 bits per color) for color illustrations should be used. For the lower quality images of the entire document, a sampling frequency of about 200 pixels per inch, with 8 bits per pixel, is sufficient.
Conversion to digital images should be done using the original documents, rather than through an intermediary photographic process. At about 600 ppi and with 24 bits per pixel for color, the quality of most illustration images will be better than, or comparable to, that of conventional 4" x 5" color photographic transparencies. Even at 300 ppi, the quality of the digital images exceeds that of 35 mm color transparencies, except for the smallest illustrations. Digitization from either photographic transparencies or prints would definitely be inferior to that done from the original documents because of the accumulation of dust or other defects on the photographs and the difficult control of color during a multi-step process.
Chapter 8 is quite important for issues we have discussed.
<http://www.nmnh.si.edu/cris/techrpts/imagopts/section8.html>
<http://palimpsest.stanford.edu/bytopic/imaging/>
Not much here currently. *sigh* Note however that the Palimpsest site also has the Archives of mailing lists such as PhotoHistory and Exlibris (Rare Books and Special Collections).
Re: large scale image database
Howard Besser's homepage <http://www.sims.berkeley.edu/~howard#multimedia>
has links
to image databases information. See also the Digital Library Sunsite on image databases that he maintains <http://sunsite.Berkeley.EDU/Imaging/Databases/>
and the Museum Educational Site Licensing Project <http://www.ahip.getty.edu/mesl/>.
The Getty Museum <http://www.getty.edu> has some now dated research on images. Long on good theory, but short of specific recommendations. *sigh* To the best of my knowledge there has not been any new research or publications on resolution from Getty.
New publications have emphasized cataloging and description of images, and this emphasis fits nicely with Getty's sponsorship of the Art and Architecture Thesaurus (AAT), widely used in catalog as an alternative to LCSH.
<http://etext.virginia.edu>. The U-Va standard was set experimentally at 400 dpi, but notes that resolution as high as 600 dpi is recommended by the Research Library Group (RLG) http://etext.lib.virginia.edu/helpsheets/specscan.html. Virginia's 400 dpi standard appears to have evolved from the need for getting the clearest possible scans for OCR. New projects, particularly in Virginia's Special Collections, may change that standard. U-Va has an excellent series of online hand-outs for using scanners and Photoshop, here.
David Seaman's (anonymous/unsigned) comments on this same page suggest the difference between "preservation" (minimal) scanning intended only to capture basic information like printed words and pictures or handwriting, and "archival" (fullest) scanning. It makes interesting reading in light of attacks on preservation microfilming by Nicholson Baker (see his Doublefold). Seaman writes in part:
For the preservation world, there is a heavy reliance on high-speed, 1-bit (simple black and white) page images shot at 600 dpi and stored at Group 4 fax-compressed files. This gives an image reminiscent of a microfilm image. For a straightforward printed page with no graphics, 1-bit imaging maintains the ability to read the content but it gives no sense of the page as an artifact -- no shading, no color, etc. What we here call archival imaging assumes that one's needs are for high quality images that replicate not simply the information on a page -- as a black and white image does for typeset material at least -- but the experience and visual nuances of the original. A high-quality color image (24-bit) does this -- the value is not simply for specialist use, but for general purpose users too. Some of our most excited and emotional users are members of the general public and high school students who use the color images of rare manuscripts and books. (http://etext.lib.virginia.edu/helpsheets/specscan.html)
Description. Embedding descriptive or cataloging information directly into an image would seem to be an ideal solution to resolving the problem of images getting separated from the cataloging information (in text files or databases). Virginia's E-Text Center is the only institution (that I know about) which had published its standards for embedding cataloging information in images. The system can, no doubt, be adapted.
During my tenure, the Gilder Lehrman Collection tried unsuccessfully to embed cataloging information using Photoshop. Instead, we decided to use the IPTC/ANA fields that are reserved in the TIFF file format. The fields were created for newspaper photographs. They're not ideal. In some respects, it would be better to embed agreed-upon SGML or XML MARC cataloging information or TEI standardized information, or even Dublin Core information (see pages on SGML (Standard Generalized Markup Language) and Meta tags). On the plus side, if we ever provided these annotated images to a service bureau, they'd have all our crediting information already embedded in industry-standard captioning fields.
Another downside to embedded information: should descriptions or subject terms need changing, it would require modifying each one.
The JPEG Tutorial: <http://www.imaging.org/tutorial/jpegtutorial/>, tutorial by Edward J. Delp, Purdue University. This page presents a brief description of how JPEG compresses images. Pretty good without being too technical.
A recent file format for the web, while it shows some promise the widespread adoption of JPEG, GIF and TIFF will make acceptance more difficult. Changes to the JPEG format, especially improving compression, make PNG's acceptance problematic. (3/2002)
An uncompressed format (if you don't try some LZW compression) can be opened by almost any graphics now, and probably every one in the future as well.
A Kodak proprietary format for the Kodak PhotoCD. While the CDs are long-lasting, it remains to be seen how long the format will remain supported. The <http://www.kodak.com> Kodak web site has information, but Phil Greenspun, below, is easier to follow. See also the Cornell Study mentioned above.
Another format format from Kodak. Phil Greenspun seemed very excited by this format in 1998, but as of 2002 there's not much done with it. It holds multiple resolutions like PCD.
Philip Greenspun (below) has some good links on Adobe Photoshop.
Nice tutorial on making JPGs from PCD images <http://www.peimag.com/swan_cd.htm>
Berkeley Multimedia Research Center - DIGITAL CAMERAS. <http://www.bmrc.berkeley.edu/articles/9612-01.html> This is an increasingly dated study. General principles are mentioned elsewhere; camera reviews for consumer and professional applications can be found in computer magazines or PhotoNet. (However for museums-quality and similar scanning-back cameras, you are better off talking to colleagues for their experiences and looking at advertisements in Museums, published by the American Association of Museums.)
Except for Greenspun on digital images, these notes have been moved here.
Philip Greenspun, MIT Computer Scientist and published photographer, has interesting thoughts on graphics and provocative thoughts on sites. (Also see my comments and links about his very interesting comments about databases.) He maintains a <http://www.photo.net> site. For digital archives he recommends PhotoCD. He has very useful ideas on how to organize (for example, keeping part of the CD serial number in filenames for image, even when modified). Lately he has become rather obsessed with HP's FlashPix.
Compiled by Paul Romaine.
Add/View Comments (via Phil Greenspun's Loquacious)
Related Links on Digital Imaging
[Home] [Work and GLC] [SGML] [Issues] [Search]
Contact Paul Romaine.