Unicode string comparison in python 2.7, how to normalise html elements like £ != \xa3

Question

I have a database and a fresh unicode input.

if new_field != old_field:
    update_db(new_field)

In PyCharm both appear identical, even when I hover and expand the "view" box I copy paste them into notepad and they are identical, eg:

u"<li>Laundromatic doofer drier £200 on collection</li>"
u"<li>Laundromatic doofer drier £200 on collection</li>"

What is causing the miscomp is the underlying encoding of the pound sign (why can't it have a pythonic single way?). They are both unicode; type(new_field) is unicode.

I got so frustrated by this that I broke each field (a load of sales blurb) down as so:

>set(old_field.split()) ^ set(new_field.split())

u'£200'  # from new_field
u'&#163;200'  # from old_field

Is there a better way to compare unicode in python (I'm using 2.7)? i.e. something more universal than

if new_field != old_field.replace(u"&#163;", u"\xa3")

The new field came from the web then was passed to bleach.clean where I had to pass it to .encode("utf-8") because it was apparently producing (sqlite3) illegal characters to represent nbsp. The old field has been fetched from sqlite before passing it through bleach.clean (as an afterthought), which did not require .encode("utf-8"), since sqlite only stores unicode.

John · Accepted Answer · 2018-03-10T20:10:44.293

from HTMLParser import HTMLParser
new_field = u'\xa3200'
old_field = u'&#163;200'
h = HTMLParser()
equiv = h.unescape(old_field) == h.unescape(new_field)
equiv2 = h.unescape(old_field) == new_field  # To be clear only ampersand hash strings get replaced.
print(u"   {} == {} ? {} and {}".format(new_field, old_field, equiv, equiv2))


   £200 == &#163;200 ? True and True

This is how I did it, credit to Fabich, https://stackoverflow.com/a/38481378/866333.

One can't have too many "Decoding ampersand hash strings" type questions; Google just gives me the pound sign back!

Unicode string comparison in python 2.7, how to normalise html elements like £ != \xa3

1 Answers1