1

I need to compare a unicode string coming from a utf-8 file with a constant defined in the Python script.

I'm using Python 2.7.6 on Linux.

If I run the above script within Spyder (a Python editor) I got it working, but if I invoke the Python script from a terminal, I got the test failing. Do I need to import/define something in the terminal before invoking the script?

Script ("pythonscript.py"):

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import csv

some_french_deps = []
idata_raw = csv.DictReader(open("utf8_encoded_data.csv", 'rb'), delimiter=";")
for rec in idata_raw:
    depname = unicode(rec['DEP'],'utf-8')
    some_french_deps.append(depname)

test1 = "Tarn"
test2 = "Rhône-Alpes"
if test1==some_french_deps[0]:
  print "Tarn test passed"
else:
  print "Tarn test failed"
if test2==some_french_deps[2]:
  print "Rhône-Alpes test passed"
else:
  print "Rhône-Alpes test failed"

utf8_encoded_data.csv:

DEP
Tarn
Lozère
Rhône-Alpes
Aude

Run output from Spyder editor:

Tarn test passed
Rhône-Alpes test passed

Run output from terminal:

$ ./pythonscript.py 
Tarn test passed
./pythonscript.py:20: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  if test2==some_french_deps[2]:
Rhône-Alpes test failed
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Antonello
  • 6,092
  • 3
  • 31
  • 56
  • Gah, Spyder does all *sorts* of things to break normal a Python environment. In this case I strongly suspect the default implicit conversion encoding is changed. – Martijn Pieters Jun 04 '14 at 10:16
  • what does `locale` show from bash? – Padraic Cunningham Jun 04 '14 at 10:48
  • @PadraicCunningham: the locale has no influence on how Python coerces between Unicode and byte strings. – Martijn Pieters Jun 04 '14 at 11:24
  • @MartijnPieters, yep, I misunderstood the problem originally. I don't think there is any need to use u"Tarn" if the coding is already declared, just comparing should work or am I missing something? – Padraic Cunningham Jun 04 '14 at 11:35
  • @PadraicCunningham: The codec only tells Python how to interpret newlines and how to decode bytes for Unicode literals. Byte string literals are not automatically decoded to Unicode values when you declare a codec, no. – Martijn Pieters Jun 04 '14 at 11:37
  • @MartijnPieters, does using `depname = rec['DEP']`and with encoding declared not just interpret test2 as `'Rh\xc3\xb4ne-Alpes'` therefore the comparison works with `some_french_deps[2]`? – Padraic Cunningham Jun 04 '14 at 11:45
  • @PadraicCunningham: yes, not decoding to Unicode would work there. However, I always advice to work with Unicode values; decode early, encode late (e.g. the Unicode sandwich). – Martijn Pieters Jun 04 '14 at 11:47
  • @MartijnPieters, last question! Is there a particular reason for doing it that way? – Padraic Cunningham Jun 04 '14 at 11:49
  • @PadraicCunningham: Imagine having to process more than one file, with different encodings. Or needing to slice the text (get only the first 20 characters, for example). Or having to normalise the text, because an input file used a plain character plus combining accent while other data used the combined character instead. Etc. Unicode is hard enough without having to take into account additional encoding idiosyncrasies. You don't leave numeric data in text form either; you decode to native types too. – Martijn Pieters Jun 04 '14 at 11:53

3 Answers3

1

You are comparing a byte string (type str) with a unicode value. Spyder has changed the default encoding from ASCII to UTF-8, and Python does an implicit conversion between byte strings and unicode values when comparing the two types. Your byte strings are encoded to UTF-8, so under Spyder that comparison succeeds.

The solution is to not use byte strings, use unicode literals for your two test values instead:

test1 = u"Tarn"
test2 = u"Rhône-Alpes"

Changing the system default encoding is, in my opinion, a terrible idea. Your code should use Unicode correctly instead of relying on implicit conversions, but to change the rules of implicit conversions only increases the confusion, not make the task any easier.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
1

Just using depname = rec['DEP'] should work as you have already declared the encoding.

If you print some_french_deps[2] it will print Rhône-Alpes so your comparison will work.

Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
0

As you are comparing a string object with a unicode object, python throws this warning.

To fix this, you can write

test1 = "Tarn"
test2 = "Rhône-Alpes"

as

test1 = u"Tarn"
test2 = u"Rhône-Alpes"

where the 'u' indicates it is a unicode object.

donglixp
  • 46
  • 3