Compare unicode string with byte string

Question

Version: Python 2.7

I'm reading values from a Unicode CSV file and looping through to find a particular product code - a string. The variable p is from the CSV file.

sku = '1450'             # sku can contain spaces.
print p, '|', sku
print p == '1450'
print binascii.hexlify(p), '|', binascii.hexlify(sku)
print binascii.hexlify(p) == binascii.hexlify(sku)
print 'repr(p): ', repr(p)

which results in

1450 | 1450
False
003100340035003000 | 31343530
False
repr(p): '\x001\x004\x005\x000\x00'

Q1. What is a future-proof way (for version 3, etc.) to successfully compare? Q2. The Unicode is little-endian. Why have I got 00 at both ends of the Unicode hex?

Note: attempts at converting to Unicode - u'1450' - don't seem to have any affect on the output.

Thanks.

why do you mention Unicode when all you have shown us is ASCII digits? And when you say a Unicode file, you should also say WHICH Unicode encoding. — Walter Tross, Dec 19 '20 at 18:39
@WalterTross Python character strings are Unicode, even if all the characters are ASCII. — Mark Ransom, Dec 19 '20 at 18:41
@snakecharmerb: repr(p): '\x001\x004\x005\x000\x00'. I'm not familiar with `repr`. — Transistor, Dec 19 '20 at 18:45
[repr](https://docs.python.org/3/library/functions.html#repr) shows you the actual content of the string, as opposed to the user-friendly version you get from `str` (`str` is called when you `print`). In this case, `repr` shows us that there are null bytes (`\x00`) between each digit, and this is a strong indication of a UTF-16 encoding, as Walter Tross has observed (in a now deleted comment). — snakecharmerb, Dec 19 '20 at 18:48
Considering your concerns about a future-proof approach, why are you programming in Python 2 at all? — Ulrich Eckhardt, Dec 19 '20 at 19:36
@Ulrich: I'm an electrical engineer using Inductive Automation's [Ignition!](http://inductiveautomation.com] SCADA system. The latest version of Ignition is using Jython 2.7. See their info [here](https://docs.inductiveautomation.com/display/DOC80/Python+Scripting). I don't know enough to know whether or not Jython numbering tracks the Python numbering but I don't have a choice other than to use the latest version they supply. — Transistor, Dec 19 '20 at 21:43

ti7 · Accepted Answer · 2020-12-19T20:06:41.293

3

This is probably much easier in Python 3 due to a change in how strings are handled.

Try opening your file with the encoding specified and pass the file-like to the csv library See csv Examples

import csv
with open('some.csv', newline='', encoding='UTF-16LE') as fh:
    reader = csv.reader(fh)
    for row in reader:  # reader is iterable
        # work with row

After some comments, the read attempt comes from a FTP server.
Switching a string read to FTP binary and reading through a io.TextIOWrapper() may work out

Out now with even more context managers!:

import io
import csv
from ftplib import FTP

with FTP("ftp.example.org") as ftp:
    with io.BytesIO() as binary_buffer:
        # read all of products.csv into a binary buffer
        ftp.retrbinary("RETR products.csv", binary_buffer.write)
        binary_buffer.seek(0)  # rewind file pointer
        # create a text wrapper to associate an encoding with the file-like for reading
        with io.TextIOWrapper(binary_buffer, encoding="UTF-16LE") as csv_string:
            for row in csv.reader(csv_string):
                # work with row

edited Dec 19 '20 at 20:06

answered Dec 19 '20 at 18:42

ti7

16,375
6
40
68

@WalterTross I think so too +`LE`; updated! – ti7 Dec 19 '20 at 18:45
How do you know it's little-endian and not big-endian? With every other byte being zero there's no way to know, especially when the leading/trailing byte of the previous/next string is also included. – Mark Ransom Dec 19 '20 at 18:50
@MarkRansom stated in the Question! – ti7 Dec 19 '20 at 18:55
@MarkRansom: I'm reading a CSV file by FTP from an industrial HMI. The HMI user manual states that the data is little-endian. – Transistor Dec 19 '20 at 18:59
@Transistor well how did I miss that? Thanks! And my previous comment tells you why there's both a leading and trailing zero byte, please let me know if you need more explanation. – Mark Ransom Dec 19 '20 at 19:02
@ti7: Thank you for your answer. I'm actually reading the file over FTP into StringIO. `r = StringIO() | ftp.retrlines('RETR products.csv', r.write)`. Can you suggest an easy way to do the conversion without saving to file and then reading it in? – Transistor Dec 19 '20 at 19:07
Hmm .. you may be able to use `ftp.retrbinary` into a `io.BytesIO` and then decode it (`.decode("UTF-16LE")`) after reading. https://docs.python.org/3/library/ftplib.html#ftplib.FTP.retrbinary – ti7 Dec 19 '20 at 19:11
@Transistor I updated my Answer to include an (untested) version of this! – ti7 Dec 19 '20 at 19:31
@Transistor fixed a critical bug: should be `BytesIO()`, not `BytesIO` – ti7 Dec 19 '20 at 20:07
1

Thank you. That gave me enough of a lead to figure it out. (There were some more details that I didn't want to trouble you with.) I got an "Attribute error" on the `with FTP` line as 2.7 doesn't have an __exit__ method. See [here](https://stackoverflow.com/a/7447369/3934992) for the answer that helped me resolve this. – Transistor Dec 19 '20 at 23:27
@Transistor Excellent! Though I do recommend converting to Python 3 if possible. You can make context managers out of any function which it makes sense to "close" with a `@contextmgr fn(): x=foo() try/yield x /finally x.close()` design, which can make logic cleaner/safer (`.close` in `finally:` block) `contextlib.contextmanager` is available in some form from Python 2.5+ https://docs.python.org/2.7/library/contextlib.html#contextlib.contextmanager – ti7 Dec 19 '20 at 23:52

Compare unicode string with byte string

1 Answers1