How to fix this unexpected behavior while reading from a file in python

Question

I am trying to read this simple file line by line in python:

q(A) p(B)
q(z) ∼p(x)

Then from each line I strip the newline and then add it to list.

lst = []
f = open("input.txt", 'r')

t1 = f.readline().rstrip('\n')
t2 = f.readline().rstrip('\n')

lst.append(t1)
lst.append(t2)

print lst

Well the problem is that when I print the content of the list I get the following output:

['q(A) p(B)', 'q(z) \xe2\x88\xbcp(x)']

My file contains the tilde character ~ and I think this causes that behavior. The weird thing is that if I would print the content of the t1 and t2 they would appear normally, but printing the content of the lst would appear different

EDIT: Answer

Well I managed to get exactly what I expected. In case anyone encounter the same problem may refer to this solution:

import codecs

f = codecs.open("input2.txt", 'r', encoding='utf8')

lst = []

t1 = f.readline().rstrip('\n')  
t2 = f.readline().rstrip('\n')  

res1 = ""
res2 = ""

for i in xrange(0,len(t1)):
    if ord(t1[i]) == 8764:
        res1 += "~"
    else:
        res1 += chr(ord(t1[i]))

for i in xrange(0,len(t2)):
    if ord(t2[i]) == 8764:
        res2 += "~"
    else:
        res2 += chr(ord(t2[i]))


lst.append(res1)
lst.append(res2)

print lst

And the output now is as below:

['q(A) p(B)', 'q(z) ~p(x)']

This is behaving as expected. Printing normally prints the string, while printing the list prints the *representation* (as if via `repr`), i.e. how you would create this string in Python. — hlt, Dec 07 '15 at 21:19
This isn't a duplicate IMHO. You were expecting the Latin tilde character but you got the unicode tilde operator. I was puzzled too until I looked up the unicode character. — tdelaney, Dec 07 '15 at 21:49

memoselyk · Answer 1 · 2015-12-07T21:35:32.953

1

The file has UTF-8 encoded data. The tilda charater is actually encoded by the '\xe2\x88\xbc' string. When you print it, it looks "normal" because something is converting those character to it's equivalent unicode glyph.

Use either codecs.open or decode functions to obtain your expected data. E.g.

f = codecs.open("input.txt", 'r', 'utf8')

You should see u'\u223c' instead of '\xe2\x88\xbc'

Also note that you have codepoint U+223C in your file, but you probably intended to use U+007E.

edited Dec 07 '15 at 21:35

answered Dec 07 '15 at 21:28

memoselyk

3,993
1
17
28

1

Just to emphasize, there are multiple tilde's in the unicode spec. You've got the [tilde operator U+223c](http://www.fileformat.info/info/unicode/char/223c/index.htm) which is different than the [tilde character U+007e](http://www.fileformat.info/info/unicode/char/007e/index.htm). – tdelaney Dec 07 '15 at 21:45

How to fix this unexpected behavior while reading from a file in python

1 Answers1