61

This question is linked to Searching for Unicode characters in Python

I read unicode text file using python codecs

codecs.open('story.txt', 'rb', 'utf-8-sig')

And was trying to search strings in it. But i'm getting the following warning.

UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal

Is there any special way of unicode string comparison ?

Community
  • 1
  • 1
ChamingaD
  • 2,908
  • 8
  • 35
  • 58
  • 2
    This link might be useful: http://nedbatchelder.com/text/unipain.html – Robᵩ Aug 12 '13 at 18:36
  • Please post a short, self-contained complete example program. Reduce your code to a five- to ten-line program that produces that error message and post the short program into your question. See http://SSCCE.ORG for more information. – Robᵩ Aug 12 '13 at 18:41

1 Answers1

83

You may use the == operator to compare unicode objects for equality.

>>> s1 = u'Hello'
>>> s2 = unicode("Hello")
>>> type(s1), type(s2)
(<type 'unicode'>, <type 'unicode'>)
>>> s1==s2
True
>>> 
>>> s3='Hello'.decode('utf-8')
>>> type(s3)
<type 'unicode'>
>>> s1==s3
True
>>> 

But, your error message indicates that you aren't comparing unicode objects. You are probably comparing a unicode object to a str object, like so:

>>> u'Hello' == 'Hello'
True
>>> u'Hello' == '\x81\x01'
__main__:1: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
False

See how I have attempted to compare a unicode object against a string which does not represent a valid UTF8 encoding.

Your program, I suppose, is comparing unicode objects with str objects, and the contents of a str object is not a valid UTF8 encoding. This seems likely the result of you (the programmer) not knowing which variable holds unicide, which variable holds UTF8 and which variable holds the bytes read in from a file.

I recommend http://nedbatchelder.com/text/unipain.html, especially the advice to create a "Unicode Sandwich."

Robᵩ
  • 163,533
  • 20
  • 239
  • 308
  • 2
    `u'Hello' == '\xc3\x81'` is valid UTF-8, and still gives the warning. On Python 2 the default codec is `ascii`. – Mark Tolonen Aug 13 '13 at 05:30
  • I get the same warning in my script in this line: `hostnames=[] ... if not (name in hostnames): ...` where `name` contains some strings in another loop. Can you please add an example how to fix such? – rubo77 Aug 07 '14 at 02:37
  • 1
    @rubo77 - Please watch & read the presentation I linked to. If that doesn't help, please open a new Stack Overflow question. – Robᵩ Aug 07 '14 at 07:10