0

I am trying to execute a python program to check the embedded metadata of an image with the help of exiftool. But, it seems like the exiftool isn't recognizing special characters in metadata values. It gives output as follows:

Description: UnitedHealth’s Optum blah blah blah......

Instead, it should be printed as follows:

Description: UnitedHealth's Optum blah blah blah...…

I tried following code:

subprocess.run(["exiftool", "-j", image_path], capture_output=True, text=True, check=True, encoding='utf-8')

I tried decoding using utf-8 encoding in the output.

I tried with json, chardet, charset, HTML encode

but, none of them seems to work.

I tried changing the encoding in my console to "utf-8".

I am using pycharm in windows OS.

VeeyesPlus
  • 57
  • 6
  • could it maybe be utf-16? – vs07 Jul 05 '23 at 12:43
  • It is a jpg file, I don't think exiftool could provide metadata in UTF-16 encoding. – VeeyesPlus Jul 05 '23 at 12:52
  • 1
    This is [ExifTool FAQ #18](https://exiftool.org/faq.html#Q18). There's something about the Perl libraries that exiftool uses that doesn't deal well with Windows command line. Note that there are no problems on Linux/Mac. On my system, I was never able to get it working until I found [this StackOverflow answer]("https://stackoverflow.com/questions/57131654/using-utf-8-encoding-chcp-65001-in-command-prompt-windows-powershell-window/57134096#57134096"). But that option may cause spacing problems in the GUIs of older programs. – StarGeek Jul 05 '23 at 22:18

1 Answers1

2

It is a mojibake problem. Deep inside some system variables will report the system encoding as X (usually cp1252 - a close cp to "latin1") - and either Python streaming decoder or exiftool will assume that - but then, in some other config variable, the terminal encoding is reported as being some other encoding (like CP437, or CP852, depending on Windows language).

I just run an example here, with a bat script that would output "Alô mundo" and it was read as "Al“ mundo!" on the cmd terminal, using subprocess.run.

What I had to do is to re-encode this to bytes using "cp1252" and then decode it using the CP reported in my CMD configuration (I had to check that in the cmd preferences dialog - from Python it would not show up in none of the three sys.get*encoding methods, neither the encoding of sys.stdout - all 4 would report "utf-8").

Note that I have a tag badge for "unicode" due to answering encoding questions like this, and I have a grasp of the underlying mechanisms...but Windows supporting partially the 40+ year old legacy for its terminal, while trying to use latin1 for the UI and utf-8 for the dev. environment is too much to wrap ones head around in a deterministic way.

I digress; try the code bellow, if that does not work, look for the CP encoding of your CP instead of CP437 (pycharm is likely replicating some of this enviroment to its subprocesses):

output = subprocess.run(["exiftool", "-j", image_path], capture_output=True, text=True, check=True).stdout
corrected_output = output.encode("cp1252").decode("cp437")
print(corrected_output)
jsbueno
  • 99,910
  • 10
  • 151
  • 209