1

Using python 2.7.11

Dashes from a utf-8 document I'm reading in are being ignored by if statements intended to detect them. The dash prints to the console as a '-' character, and when shown as a representation displays as u'-'. The character passed through ord() displays ordinal 45, which is the same as the dash character.

segment = line[:section_widths[row_index]].strip()
line = line[section_widths[row_index]+1:]
if segment:
    print 'seg'
    if segment is u'-' or segment is '-':
        print 'DASH DETECTED'
        continue
    print "ord %d" % ord(segment[0])
eadsjr
  • 681
  • 5
  • 20
  • I presume that's supposed to be character 45 (what Unicode calls "HYPHEN-MINUS") and not, for example, EN DASH (u+2013) or EM DASH (u+2014). – Keith Thompson Jan 21 '16 at 02:09
  • Yes, the original text was a "HYPHEN-MINUS" character, though in the document it was being used as a placeholder, much like an ellipses would be. – eadsjr Feb 03 '16 at 20:40

2 Answers2

3

Do not use is for equality check. Use == for equality check.

>>> 'stringstringstringstringstring' == 'string' * 5
True
>>> 'stringstringstringstringstring' is 'string' * 5
False

is should be used for identity check.

falsetru
  • 357,413
  • 63
  • 732
  • 636
0

It turns out that Python 2.7.x's 'is' does not have the same effects for unicode strings as it does for ASCII ones. This distinction is largely explained here: [ String comparison in Python: is vs. == ]

Each unicode string is an object, and this object is not the same as the one used for unicode literals.

>>> uni = unicode('unicode')
>>> uni == 'unicode'
True
>>> uni is 'unicode'
False
>>> 
>>> asc = str('ascii')
>>> asc == 'ascii'
True
>>> asc is 'ascii'
True

EDIT:

As Mark Tolonen pointed out, this is not consistent behavior.

>>> x=1
>>> x is 1
True
>>> x=10000
>>> x is 10000
False

( Run on Python 2.7.11 |Anaconda 2.4.0 (x86_64)| (default, Dec 6 2015, 18:57:58) [GCC 4.2.1 (Apple Inc. build 5577)] on darwin )

Community
  • 1
  • 1
eadsjr
  • 681
  • 5
  • 20
  • 1
    Don't rely on this. A Python implementation is free to cache immutable objects, but doesn't have to. Try `x=1` then `x is 1` vs. `x=10000` then `x is 10000`. On CPython, the first is likely to be True and the second is likely to be False. – Mark Tolonen Jan 21 '16 at 02:16