0

My sample.txt:

é Roméo et Juliette vécu heureux chaque après

My program:

#!/usr/bin/env python2.7
# -*- coding: utf-8 -*-

with open("test4", "r") as f:
        s = f.read()
        print(s)
        print(isinstance(s, unicode))
        print(s[0].isalnum())

My output:

é Roméo et Juliette vécu heureux chaque après

False
False

From Python isalpha() and scandics and How do I check if a string is unicode or ascii? lead me to believe that both statements should be true.

My hypotheses:

  1. Emacs is using "iso-latin-1" as the file encoding, which is mucking things up

  2. isalnum() depends on something other than encoding

  3. Line 2 isn't working

My biggest worry is #2. I do not really care about the result of isalnum(), I just want the result to be consistent for different machines/people. Worst case, I can just roll my own isalnum(); but I am curious why I am experiencing this behaviour in the first place.

Also, I want to be sure my program understand UTF-8 encoded documents across different machines as well.

Any ideas of what is going on?

Moe Sanjaq
  • 25
  • 6
  • 1
    BTW, `# -*- coding: utf-8 -*- ` merely tells the interpreter how to decode the following lines of your script. It has no bearing on the way your script decodes or encodes data it reads from files. If you must use Python 2 to process Unicode you should read https://nedbatchelder.com/text/unipain.html – PM 2Ring Oct 12 '18 at 18:09
  • Note that you can use the `open` function from the `io` module to provide the `encoding` directly: `io.open("filename", "r" , encoding="utf-8")`. Or use the `codecs` module. – Bakuriu Oct 12 '18 at 18:15

1 Answers1

2

Strings (type str) in Python 2.7 are bytes. When you read text from a file, you get bytes, with possibly the line endings changed. Therefore, s is not an instance of type unicode.

On a str, tests like isalnum() assume that the string is ASCII text. ASCII is defined only for codes 0 to 127. Python has no idea, and can have no idea, what characters are represented by values outside this range, because the encoding is not known. é is not an ASCII character and therefore is not considered alphanumeric.

What you want to do is decode the byte string you've read to a Unicode string:

u = s.decode("utf8")

(assuming the string is written to the file in UTF8 encoding; if that doesn't work, you can try latin1 or cp437... the latter is what my terminal gives me on Windows 10)

When you do that, u[0].isalnum() is True and isinstance(u, unicode) is also True.

Python 3 works a little differently. You have to tell Python what encoding to use when you open the file. Then it translates the strings to Unicode from that encoding as you read them. All strings in Python 3 are Unicode; there's a separate type, bytes, for byte strings. You probably ought to use Python 3 for a lot of different reasons, but its more coherent handling of text is certainly one of those reasons.

kindall
  • 178,883
  • 35
  • 278
  • 309
  • If I am understanding you correctly, s[0] is passing the first byte into isalnum. In that case if I were to have the utf-16 value 〰 (0x3030) at the beginning of my file, shouldn't s[0].isalnum() == True? Since the first byte is 0x30 which translates to 0 in ascii? I tried this and it was not true. For some reason the ord(s[0]) equalled 227.. and I don't see why though – Moe Sanjaq Oct 12 '18 at 18:27
  • @MoeSanjaq A UTF-16 encoded file starts with a [BOM](https://en.wikipedia.org/wiki/Byte_order_mark), which is encoded as `FF FE` or `FE FF`, depending on the endianness. But you should get 255 or 254 for `ord(s[0])`, not 227. – lenz Oct 12 '18 at 19:33
  • @Moe Use `with io.open('test4','w',encoding='utf-16le') as f: f.write(u'\u3030')` to write your test file. It's not a UTF-16-encoded file with the correct character if you are getting 227. – Mark Tolonen Oct 12 '18 at 21:05