Encoding in python

Question

I have problem with comparing string from file with string I entered in the program, I should get that they are equal but no matter if i use decode('utf-8') I get that they are not equal. Here's the code:

final = open("info", 'r')
exported = open("final",'w')
lines = final.readlines()
for line in lines:
    if line == "Wykształcenie i praca": #error
    print "ok"

and how I save file that I try read:

comm_p = bs4.BeautifulSoup(comm)
comm_f.write(comm_p.prettify().encode('utf-8'))

for string in comm_p.strings:
      #print repr(string).encode('utf-8')
      save = string.encode('utf-8') #  there is how i save
      info.write(save)
      info.write("\n")        

info.close()

and at the top of file I have # -- coding: utf-8 --

Any ideas?

add `print "%r %r" % (line, "Wykształcenie i praca")` right before the comparison line and tell us what it says — georg, Sep 24 '12 at 07:49

score 3 · Accepted Answer · answered Sep 24 '12 at 07:57

This should do what you need:

# -- coding: utf-8 --
import io

with io.open('info', encoding='utf-8') as final:
    lines = final.readlines()

for line in lines:
    if line.strip() == u"Wykształcenie i praca": #error
        print "ok"

You need to open the file with the right encoding, and since your string is not ascii, you should mark it as unicode.

score 0 · Answer 2 · edited May 23 '17 at 12:27

0

It is likely the difference is in a '\n' character

readlines doesn't strip '\n' - see Best method for reading newline delimited files in Python and discarding the newlines?

In general it is not a good idea to put a Unicode string in your code, it would be a good idea to read it from a resource file

edited May 23 '17 at 12:27

Community

1
1

answered Sep 24 '12 at 07:50

Ofir

8,194
2
29
44

you're right, it's difficult to notice that small mistake when you think that encoding causes error :P – adaniluk Sep 24 '12 at 07:55

score 0 · Answer 3 · answered Sep 24 '12 at 07:54

First, you need some basic knowledge about encodings. This is a good place to start. You don't have to read everything right now, but try to get as far as you can.

About your current problem:

You're reading a UTF-8 encoded file (probably), but you're reading it as an ASCII file. open() doesn't do any conversion for you.

So what you need to do (at least):

use codecs.open("info", "r", encoding="utf-8") to read the file
use Unicode strings for comparison: if line.rstrip() == u"Wykształcenie i praca":

score 0 · Answer 4 · answered Sep 24 '12 at 07:59

0

use unicode for string comparision

>>> s = u'Wykształcenie i praca'
>>> s == u'Wykształcenie i praca'
True
>>>

when it comes to string unicode is the smartest move :)

answered Sep 24 '12 at 07:59

Anuj

9,222
8
33
30

Encoding in python

4 Answers4