Python's exif module and Umlauts in JPEG Metadata

Question

I am writing a little script that should help me edit the EXIF metadata of JPEG files in Python, especially the 'artist' field, using the exif module in Python3. However, as I am German, I have to work on a few files where the author field contains an Umlaut, such as 'ü'. If I now open one of these files in 'rb' mode, create an exif Image object with myimgobj=Image(myfile) and try to access myimgobj.artist, I get a long list of multiple (!) UnicodeDecodeErrrors which are basically all the same:

'ascii' codec can't decode byte 0xc3 in position 9: ordinal not in range(128)

For some of the error messages, it is not position 9, but 0, but I guess this can all be traced back to the same reason - the Umlaut. Everything works fine if there is no Umlaut in the field. Is there any way I can work with the exif package and extract the artist, even if it contains an Umlaut?

Edit: To provide a minimal example, please consider any JPEG image where you set the artist field to 'ä' ( I'd upload one, but the EXIF tags get removed during the upload). It then fails for example when I try to print the artist like this:

from exif import Image
with open('Umlaut.jpg','rb') as imgfile:
    my_image=Image(imgfile)
    print(my_image.artist)

EXIF only has very few fields where non-ASCII is allowed, but this doesn't prevent other software to just write ISO-8859-1 to most fields. You need to prevent converting anything to Unicode/UTF-8 when reading EXIF, but instead try to treat is as ISO-8859-1 (for German). — AmigoJack, Jan 15 '21 at 09:24
@AmigoJack Okay, how can I do this, using the exif module in Python? Is there a way to tell it how to treat the strings? I figure that trying do decode the string (with my_image.artist.decode(...)) wouldn't help because the error appears already while reading from the file, correct? — DerAuenlaender, Jan 21 '21 at 20:43
No, there's not "a" way. Only heuristics, trial & error - see https://stackoverflow.com/a/90916/4299358. In your case it **could** be ISO-8859-1, but you can't tell reliably. — AmigoJack, Jan 21 '21 at 21:24

JosefZ · Answer 1 · 2021-01-22T14:28:18.053

Use the following:

import exifread

with open('Umlaut.jpg','rb') as imgfile:
    tags = exifread.process_file(imgfile)

print(tags)                     # all tags

for i,tag in enumerate(tags):
    print(i,tag, tags[tag])     # tag by tag

Result, tested with string äüÃ (== b'\xc3\xa4\xc3\xbc\xc3\x83'.decode('utf8')) inserted manually to the Authors: .\SO\65720067.py

{'Image Artist': (0x013B) ASCII=äüÃ @ 2122, 'Image ExifOffset': (0x8769) Long=2130 @ 30, 'Image XPAuthor': (0x9C9D) Byte=äüÃ @ 4210, 'Image Padding': (0xEA1C) Undefined=[] @ 62, 'EXIF Padding': (0xEA1C) Undefined=[] @ 2148}
0 Image Artist äüÃ
1 Image ExifOffset 2130
2 Image XPAuthor äüÃ
3 Image Padding []
4 EXIF Padding []

In the light of these facts, you can change your code to

from exif import Image

with open('Umlaut.jpg','rb') as imgfile:
    my_image=Image(imgfile)

# print(my_image.artist)          # error described below
print(my_image.xp_author)         # äüÃ   as expected

BTW, running your code unchanged, the following occurs (where every … Horizontal Ellipsis represents a bunch of messages in the full error traceback):

…
+--------+-----------+-------+----------------------+------------------+
| Offset | Access    | Value | Bytes                | Type             |
+--------+-----------+-------+----------------------+------------------+
|        |           |       |                      | AsciiZeroTermStr |
|        | [0:0]     | ''    |                      |                  |
| 0      | --error-- |       | c3 a4 c3 bc c3 83 00 |                  |
+--------+-----------+-------+----------------------+------------------+

UnicodeDecodeError occurred during unpack operation:

'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
…
+--------+-----------+-------+-------------------+----------+
| Offset | Access    | Value | Bytes             | Type     |
+--------+-----------+-------+-------------------+----------+
|        |           |       |                   | AsciiStr |
|        | [0:0]     | ''    |                   |          |
| 0      | --error-- |       | c3 a4 c3 bc c3 83 |          |
+--------+-----------+-------+-------------------+----------+

UnicodeDecodeError occurred during unpack operation:

'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

Python's exif module and Umlauts in JPEG Metadata

1 Answers1

Linked