Unicode Dash not detected by if statement

Question

Using python 2.7.11

Dashes from a utf-8 document I'm reading in are being ignored by if statements intended to detect them. The dash prints to the console as a '-' character, and when shown as a representation displays as u'-'. The character passed through ord() displays ordinal 45, which is the same as the dash character.

segment = line[:section_widths[row_index]].strip()
line = line[section_widths[row_index]+1:]
if segment:
    print 'seg'
    if segment is u'-' or segment is '-':
        print 'DASH DETECTED'
        continue
    print "ord %d" % ord(segment[0])

I presume that's supposed to be character 45 (what Unicode calls "HYPHEN-MINUS") and not, for example, EN DASH (u+2013) or EM DASH (u+2014). — Keith Thompson, Jan 21 '16 at 02:09
Yes, the original text was a "HYPHEN-MINUS" character, though in the document it was being used as a placeholder, much like an ellipses would be. — eadsjr, Feb 03 '16 at 20:40

score 3 · Accepted Answer · answered Jan 21 '16 at 02:05

3

Do not use is for equality check. Use == for equality check.

>>> 'stringstringstringstringstring' == 'string' * 5
True
>>> 'stringstringstringstringstring' is 'string' * 5
False

is should be used for identity check.

answered Jan 21 '16 at 02:05

falsetru

357,413
63
732
636

score 0 · Answer 2 · edited May 23 '17 at 10:28

It turns out that Python 2.7.x's 'is' does not have the same effects for unicode strings as it does for ASCII ones. This distinction is largely explained here: [ String comparison in Python: is vs. == ]

Each unicode string is an object, and this object is not the same as the one used for unicode literals.

>>> uni = unicode('unicode')
>>> uni == 'unicode'
True
>>> uni is 'unicode'
False
>>> 
>>> asc = str('ascii')
>>> asc == 'ascii'
True
>>> asc is 'ascii'
True

EDIT:

As Mark Tolonen pointed out, this is not consistent behavior.

>>> x=1
>>> x is 1
True
>>> x=10000
>>> x is 10000
False

( Run on Python 2.7.11 |Anaconda 2.4.0 (x86_64)| (default, Dec 6 2015, 18:57:58) [GCC 4.2.1 (Apple Inc. build 5577)] on darwin )

Don't rely on this. A Python implementation is free to cache immutable objects, but doesn't have to. Try `x=1` then `x is 1` vs. `x=10000` then `x is 10000`. On CPython, the first is likely to be True and the second is likely to be False. — Mark Tolonen, Jan 21 '16 at 02:16

Unicode Dash not detected by if statement

2 Answers2