Grab description/keywords from a Tiff image?

Question

I have a number of TIFF files which contain descriptions and "keywords" (as OS X terms them in the file inspector). I'm having difficulty collecting this metadata from the images, however.

I've tried using tifffile.py, PIL's exif commands and IPTCInfo, and while tifffile.py will get the description I still can't seem to parse the "keywords" from the file using any of these libraries.

Are keywords stored using a different "specification" for TIFFs than for JPEGs? What would be the best approach to parse these keywords?

EDIT

Further to the comment from abarnert, I opened one of the TIFF files in a text editor and found that there is XML data which contains the "keywords". Snippet below:

...
<rdf:Description rdf:about=""
xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:description>
<rdf:Alt>
 <rdf:li xml:lang="x-default">OLYMPUS DIGITAL CAMERA</rdf:li>
</rdf:Alt>
</dc:description>
<dc:format>image/tiff</dc:format>
<dc:subject>
<rdf:Bag>
 <rdf:li>Foo</rdf:li>
 <rdf:li>Bar</rdf:li>
 <rdf:li>A long keyword</rdf:li>
</rdf:Bag>
</dc:subject>
</rdf:Description>
...

It looks as though this could be stored as a binary representation; tifffile.py lists a number of tags that are essentially tuples of integers. I'm not sure how to convert this, however. Suggestions?

@cgohlke not sure this is possible on OS X; there isn't a package in PyPI and the build process looks as though it won't allow it to be sandboxed (a requirement for me). — Phillip B Oldham, Jul 04 '12 at 08:17

score 2 · Answer 1 · answered Jul 03 '12 at 17:54

Are keywords stored using a different "specification" for TIFFs than for JPEGs?

Well, it depends.

The TIFF container has only a very limited set of metadata tags, and doesn't have any way of specifying arbitrary keywords.

JPEG isn't a container type at all; it's an image compression codec, which can be used in a variety of different containers, including TIFF. But usually when you say "JPEG file" you mean JFIF, one of the container formats specified by the JPEG group, and, like TIFF, JFIF has a very limited set of metadata tags.

Exif is another container format, identical to TIFF as far as structure, but it defines new tags, expressly for metadata, which means you can trivially wrap a TIFF as an Exif and it's still a legal TIFF, and with a bit of sneakiness you can also wrap a JFIF as an Exif in a way that's not quite a legal JFIF but almost all software accepts it anyway.

Exif is the only common way to add metadata to JFIF (not counting DCF, which is basically the same thing as Exif), but it's one of multiple different ways to add metadata to TIFF. IPTC is another one, as are XMP, OME, and probably lots of others.

So, some TIFF files store "keywords" using the same specification as JFIF-wrapped-in-Exif, but others don't.

What would be the best approach to parse these keywords?

Well, you need to know what format they're stored in.

Needless to say, PIL's exif commands only support Exif, IPTCInfo only supports IPTC, and tifffile mostly supports… well, a variety of different things.

I believe tifffile.py can store unknown tag types are raw binary data, which you can iterate through and see what you're missing. That will at least cover all the extensions that use TIFF container structure. If you don't find the keywords there, then… at least that rules out many common formats.

Anyway, once you know which format you're looking for, you can look for a library that can handle it. (Or, if it's one of the XML-based ones, just read the tag as binary data with tifffile, then parse that as UTF-8 XML, which is probably easier than finding a different library.)

I can see what could be binary data in some of the tags parsed by tifffile (long tuples containing integers) - what would be the approach to parse this into something I can pass to `lxml`? — Phillip B Oldham, Jul 04 '12 at 08:11
Of course it would have to be one of the formats I didn't mention, DCMI… Or is it DCMI+OMF? Either way, this isn't the same as XMP—but, like XMP, it's an XML document type, which is generally embedded in TIFF as a single tag, just by storing a UTF-8 string as the tag value. It looks like tifffile doesn't know how to handle this, so you'll have to extend it. You could treat it as binary data and then just decode UTF-8 before decoding XML, or (better) add a read_utf8 function and reference it in CUSTOM_TAGS. At this point, you may be better off talking to the author? — abarnert, Jul 05 '12 at 17:42
PS, you may want to look at http://dublincore.org/ and http://en.wikipedia.org/wiki/Dublin_Core for more information on the different ways DCMI data can be represented, although I don't know where to find documentation on how DCMI-RDF or DCMI-OMF or whatever you have gets embedded in TIFF. — abarnert, Jul 05 '12 at 17:46

Grab description/keywords from a Tiff image?

1 Answers1