0

I have some base64 encoded text fields in some XML data.

To get all the characters showing correctly, I think I need to find an additional encoding used on this text, which is not UTF-8 by the look of it. ?And maybe some other encoding aspect too, not sure..

I am not sure what order I should be encoding and decoding here - following https://www.geeksforgeeks.org/encoding-and-decoding-base64-strings-in-python/ I tried to first:

  1. Encode the whole string with every possible Python2.7 encoding, then
  2. decode with base64

(same result each time, no standard representation of problem characters)

Then I tried:

  1. encode string with utf8
  2. decode with base64
  3. decode the bytes string with every possible Python2.7 encoding

However, none of these answer strings seem to get any standard representation of the problem characters, which should display as 'é' and 'ü'.

I enclose this example string, where I am sure what the final correct text should be. Original base64 string: b64_encoded_bytes = 'R3KfbmRlciBGco5kjnJpYyBKb3Vzc2V0JiMxMzsmIzEzO3NlbGVjdGlvbiBjb21taXR0ZWUgZm9yIGFydGlzdCByZWNpZGVuY3k6IFZpbmNpYW5jZSBEZXNwcmV0LCBLb3lvIEtvdW9oLCBDaHJpc3RpbmUgbWFjZWwsIEhhbnMtVWxyaWNoIE9icmlzdCwgTmF0YT9hIFBldHJlP2luLUJhY2hlbGV6LCBQaGlsaXBwZSBWZXJnbmU='

Text string with correct 'é' and 'ü' characters at beginning, deduced from European language knowledge:

'Gründer Frédéric Jousset

selection committee for artist recidency: Vinciance Despret, Koyo Kouoh, Christine macel, Hans-Ulrich Obrist, Nata?a Petre?in-Bachelez, Philippe Vergne'

Note the '
' is HTML encoding of apparently new line character used in Windows, and '?' might also resolve to another correct character with correct encoding, or possibly '?' is actual display in original data.

snakecharmerb
  • 47,570
  • 11
  • 100
  • 153
Will Croxford
  • 457
  • 2
  • 7
  • 21

1 Answers1

1

It seems to be encoded with mac_roman:

>>> b64 = 'R3KfbmRlciBGco5kjnJpYyBKb3Vzc2V0JiMxMzsmIzEzO3NlbGVjdGlvbiBjb21taXR0ZWUgZm9yIGFydGlzdCByZWNpZGVuY3k6IFZpbmNpYW5jZSBEZXNwcmV0LCBLb3lvIEtvdW9oLCBDaHJpc3RpbmUgbWFjZWwsIEhhbnMtVWxyaWNoIE9icmlzdCwgTmF0YT9hIFBldHJlP2luLUJhY2hlbGV6LCBQaGlsaXBwZSBWZXJnbmU='
>>> bs = base64.b64decode(b64)
>>> bs
b'Gr\x9fnder Fr\x8ed\x8eric Jousset

selection committee for artist recidency: Vinciance Despret, Koyo Kouoh, Christine macel, Hans-Ulrich Obrist, Nata?a Petre?in-Bachelez, Philippe Vergne'
>>> print(bs.decode('mac_roman'))
Gründer Frédéric Jousset

selection committee for artist recidency: Vinciance Despret, Koyo Kouoh, Christine macel, Hans-Ulrich Obrist, Nata?a Petre?in-Bachelez, Philippe Vergne

The question marks in "Nata?a Petre?in-Bachelez" are present in the original data, presumably the result of a previous encoding/decoding problem.

snakecharmerb
  • 47,570
  • 11
  • 100
  • 153
  • Ah thanks kindly, I printed out just the unicode strings in Python2, saw ü as u'\xfc' and didnt realise this was the same... As I understand from just checking Wikipedia, Mac Roman was default Mac OS encoding before Mac OS X, so maybe this data was dumped on an old Mac system, otherwise while in theory, this same answer could be obtained using 'mac-greek', 'mac-latin2' etc, Occam's razor would suggest Mac-Roman is best thing to go for rest of this data! Most grateful to learn something new about encoding. – Will Croxford May 10 '21 at 17:29