-2

I'm wondering how can I convert ISO-8859-2 (latin-2) characters (I mean integer or hex values that represents ISO-8859-2 encoded characters) to UTF-8 characters.

What I need to do with my project in python:

  1. Receive hex values from serial port, which are characters encoded in ISO-8859-2.
  2. Decode them, this is - get "standard" python unicode strings from them.
  3. Prepare and write xml file.

Using Python 3.4.3

txt_str = "ąęłóźć"
txt_str.decode('ISO-8859-2')
Traceback (most recent call last): File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'decode'

The main problem is still to prepare valid input for the "decode" method (it works in python 2.7.10, and thats the one I'm using in this project). How to prepare valid string from decimal value, which are Latin-2 code numbers?

Note that it would be uber complicated to receive utf-8 characters from serial port, thanks to devices I'm using and communication protocol limitations.

Sample data, on request:

68632057
62206A75
7A647261
B364206F
20616775
777A616E
616A2061
6A65696B
617A20B6
697A7970
6A65B361
70697020
77F36469
62202C79
6E647572
75206A65
7963696C
72656D75
6A616E20
73726F67
206A657A
65647572
77207972
73772065
00000069

This is some sample data. ISO-8859-2 pushed into uint32, 4 chars per int.

bit of code that manages unboxing:

l = l[7:].replace(",", "").replace(".", "").replace("\n","").replace("\r","") # crop string from uart, only data left
vl = [l[0:2], l[2:4], l[4:6], l[6:8]] # list of bytes
vl = vl[::-1] # reverse them - now in actual order

To get integer value out of hex string I can simply use:

int_vals = [int(hs, 16) for hs in vl]
user2046193
  • 47
  • 2
  • 7
  • 2
    It should be as simple as: this_is_the_text_string.decode('ISO-8859-2'), which gives you the unicode string (at least in Python 3). – elzell Feb 02 '16 at 07:50
  • 1
    Easy. Convert from [hex to bytes](https://docs.python.org/2/library/binascii.html#binascii.a2b_hex), [decode as latin-2](https://docs.python.org/2/library/stdtypes.html#str.decode), [encode as UTF-8](https://docs.python.org/2/library/stdtypes.html#str.encode). Do you have any sample data? – Martijn Pieters Feb 02 '16 at 07:51
  • 1
    However, if you are going to write XML, why not keep the value as Unicode (so decoded from ISO-8859-2), and leave it to the XML library to encode to UTF-8? – Martijn Pieters Feb 02 '16 at 07:52
  • you should look at this. [http://stackoverflow.com/questions/26125141/str-object-has-no-attribute-decode-in-python3](http://stackoverflow.com/questions/26125141/str-object-has-no-attribute-decode-in-python3) – bender Feb 02 '16 at 08:25
  • The string type in Python 3 is Unicode. If you want to enter raw single bytes, use the Python 3 byte string data type; but then you'll need to encode the bytes as hex, not as characters (because those are Unicode characters). – tripleee Feb 02 '16 at 08:26
  • The sample data has only four characters which are not plain old 7-bit ASCII; and the byte order is less than transparent here, so it's not entirely clear how to return this to human-readable text. You could make it more useful and less bulky with something like `b'\xb1\xea\xb3\xf3\xbc\xe6'` instead (and perhaps the information that this is the ISO-8859-2 representation of the string `"ąęłóźć"`); but then that's already the answer to your question, I guess. – tripleee Feb 02 '16 at 08:33
  • Assuming you're reading data in Python with PySerial, then you should be receiving a byte string from `read()`, which does support the `decode()` method. I think your test code isn't representative of what you actually want to do. – Alastair McCormack Feb 02 '16 at 08:44
  • What you see there is values from uart device, and it is input to my app, can't do anything about it. I still don't know how to actually get ISO-8859-2 char (which I can decode and encode to utf-8 again) from decimal value, which is my main concern. @edit Alastair You are correct, I'm reading it with PySerial. Though the "output" from serial is hex integers which represent 4 ISO-8859-2 chars, so .decode() on read() won't work. – user2046193 Feb 02 '16 at 08:50
  • So your remote UART device is actually encoding to ISO-8859-2, then encoding that to an ASCII hex representation? – Alastair McCormack Feb 02 '16 at 08:55
  • Or are you saying your remote device is encoding to ISO-8895-2, then shoved into a uint32, then encoded as ASCII hex before putting on the wire? – Alastair McCormack Feb 02 '16 at 09:13
  • My remote uart device can only print hex values which are in it's register. It's wireless comm device. I use android app to send string to it, and to make it easy and reliable on hardware part, I have to use 8b encoding. More than that - my project determines, that I have to use Latin-2+ encoding (ex. Polish signs). I have commented your code with details. – user2046193 Feb 02 '16 at 09:17
  • I mean - I have implemented UTF-16 -> ISO-8859-2 conversion, putting four of them in one uint32 and sending it over BT in Android App – user2046193 Feb 02 '16 at 09:22
  • Downvoting and voting to close as actual problem is not properly defined and far too specific – Alastair McCormack Feb 02 '16 at 10:14

3 Answers3

2

Your example doesn't work because you've tried to use a str to hold bytes. In Python 3 you must use byte strings.

In reality, if you're using PySerial then you'll be reading byte strings anyway, which you can convert as required:

with serial.Serial('/dev/ttyS1', 19200, timeout=1) as ser:
    s = ser.read(10)
    # Py3: s == bytes
    # Py2.x: s == str
    my_unicode_string = s.decode('iso-8859-2')

If your iso-8895-2 data is actually then encoded to ASCII hex representation of the bytes, then you have to apply an extra layer of encoding:

with serial.Serial('/dev/ttyS1', 19200, timeout=1) as ser:
    hex_repr = ser.read(10)
    # Py3: hex_repr == bytes
    # Py2.x: hex_repr == str

    # Decodes hex representation to bytes
    # Eg. b"A3" = b'\xa3'
    hex_decoded = codecs.decode(hex_repr, "hex") 
    my_unicode_string = hex_decoded.decode('iso-8859-2')

Now you can pass my_unicode_string to your favourite XML library.

Alastair McCormack
  • 26,573
  • 8
  • 77
  • 100
  • 1
    Thanks for answer. Actual input, which I receive from uart, is shown in first post. There is nothing I can do about it, thast just what I can read from my uart peripheral. I know, that those characters are encoded in following way: 1. Each byte ([0:2], [2:4]...) is hex number which represents ISO-8859-2 character. 2. On each line, first byte is the last (LE/BE). 3. "00" indicates, that input string was not dividable by 4. – user2046193 Feb 02 '16 at 09:13
  • ISO-8859-2 doesn't have byte endianess as each character is only 1 byte, so the string must also be encoded into a uint32 too? Perhaps you can share some code from the remote side, as it's not clear at all – Alastair McCormack Feb 02 '16 at 09:20
  • Check out first post. Sadly I can't show you all the code because of the IP agreement. – user2046193 Feb 02 '16 at 09:29
2

Interesting sample data. Ideally your sample data should be a direct print of the raw data received from PySerial. If you actually are receiving the raw bytes as 8-digit hexadecimal values, then:

#!python3
from binascii import unhexlify
data = b''.join(unhexlify(x)[::-1] for x in b'''\
68632057
62206A75
7A647261
B364206F
20616775
777A616E
616A2061
6A65696B
617A20B6
697A7970
6A65B361
70697020
77F36469
62202C79
6E647572
75206A65
7963696C
72656D75
6A616E20
73726F67
206A657A
65647572
77207972
73772065
00000069'''.splitlines())

print(data.decode('iso-8859-2'))

Output:

W chuj bardzo długa nazwa jakiejś zapyziałej pipidówy, brudnej ulicyumer najgorszej rudery we wsi

Google Translate of Polish to English:

The dick very long name some zapyziałej Small Town , dirty ulicyumer worst hovel in the village
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
-1

This topic is closed. Working code, that handles what need to be done:

x=177
x.to_bytes(1, byteorder='big').decode("ISO-8859-2")
user2046193
  • 47
  • 2
  • 7