Python printing of lxml objects with foreign characters

Question

I used a Python SDK provided by a credit card processing company to access credit card transactions. Everything works fine except when I hit a customer address that contains foreign (e.g., accented) characters. In that case, depending on what I do, the script either crashes or it output garbled text.

For example, the following lines...

            print u'\u0420\u043e\u0441\u0441\u0438\u044f'
            print unicode(billTo.address)
            print billTo.address.__class__

yield:

Россия

Nollekensweg ï¿½32

type 'lxml.objectify.StringElement'

Notice that the first line shows that unicode CAN be correctly printed. The second line illustrates the garbled address, where the three characters before '32' should presumably be a single accented or special character. The third line shows what kind of object billTo.address is.

Also, if I try using "print billTo.address" instead of "print unicode(billTo.address)", the program throws an error.

Any ideas what I need to do to correctly retrieve and print the contents of billTo.address? Note that I have no control over what the software I've been given puts into that object.

EDIT: Adding the traceback:

`Traceback (most recent call last):
  File "get_transaction_details.py", line 352, in <module>
    download_transaction_details('8386560251')
  File "get_transaction_details.py", line 237, in download_transaction_details
    print billTo.address
UnicodeEncodeError: 'ascii' codec can't encode characters in position 13-15: ordinal not in range(128)`

You should check the encoding of the data you get and convert to something you can work with — Dekel, Dec 31 '16 at 01:30
BTW: always add in question FULL error message (Traceback) - it is more usefull than text "the program throws an error." — furas, Dec 31 '16 at 01:32
Are you creating the `lxml.objectify` objects or is this API doing it for you? Maybe `billTo.address.text` works better. Can you `print repr(billTo.address.text)` so we can see object type and the strange data? (you may need to remove sensitive information). Best of all worlds is if you can work this into an example program that we can run. — tdelaney, Dec 31 '16 at 01:44
API does it for me. Here's what I get: u'Nollekensweg \xef\xbf\xbd32' — Grant Petty, Dec 31 '16 at 01:46
I just found another page that identifies the escaped sequence. So it's a replacement character, not the actual desired character: http://stackoverflow.com/questions/11159118/incorrect-string-value-xef-xbf-xbd-for-column In short, it appears to me that it's a problem with the API, not with anything I control. — Grant Petty, Dec 31 '16 at 01:49
That's mojibake. The original value had a replacement character inserted as Unicode, it got encoded as UTF-8, then decoded to (probably) `Windows-1252` or `latin1`. So you'll have to go upstream to figure out how the value got the way it did. — Mark Tolonen, Dec 31 '16 at 01:52
BTW, `Nollekensweg ï¿½32` is the correct way to display the Unicode string you have, it's just already messed up. And, since correcting the encoding only gives you the replacement character, there isn't much you can do. — Mark Tolonen, Dec 31 '16 at 01:54

Python printing of lxml objects with foreign characters

0 Answers0