I'm working with a data set that has a mix of Polish words spelled with Unicode characters and some ASCII equivalents. For example, the polish word, Łęg is written as Leg in some places. Is there a way that I can convert the unicode spellings into ASCII text so that I can compare the two? I'm looking for something like this:
'Leg' == unicode_to_ascii('Łęg') # this comparison should return True
It seems like there's a way to do this in php
edit I've had some limited success using str.normalize() from pandas (which just calls the python's unicodedata.normalize() )
df[col] = (
df[col]
.str.normalize('NFKD')
.str.encode('ASCII','ignore')
.str.decode('utf-8')
)
The problem is that not all characters can be converted without error. For example, trying to encode the lowercase ł character into ASCII gives the following error:
UnicodeEncodeError: 'ascii' codec can't encode character '\u0142' in position 2: ordinal not in range(128)
This works for most other characters though (eg. ę is converted to e just fine). How can I ensure that all characters are converted correctly?