1

I'm working with a data set that has a mix of Polish words spelled with Unicode characters and some ASCII equivalents. For example, the polish word, Łęg is written as Leg in some places. Is there a way that I can convert the unicode spellings into ASCII text so that I can compare the two? I'm looking for something like this:

 'Leg' == unicode_to_ascii('Łęg') # this comparison should return True

It seems like there's a way to do this in php


edit I've had some limited success using str.normalize() from pandas (which just calls the python's unicodedata.normalize() )

df[col] = (
    df[col]
    .str.normalize('NFKD')
    .str.encode('ASCII','ignore')
    .str.decode('utf-8')
)

The problem is that not all characters can be converted without error. For example, trying to encode the lowercase ł character into ASCII gives the following error:

UnicodeEncodeError: 'ascii' codec can't encode character '\u0142' in position 2: ordinal not in range(128)

This works for most other characters though (eg. ę is converted to e just fine). How can I ensure that all characters are converted correctly?

David
  • 606
  • 9
  • 19
  • I would map individual letters from the Unicode Polish to the Ascii equivalent manually. If the Ascii data used the Polish code page, your life would be somewhat easier...maybe. That is, assuming you do not have a L in Polish. But since you probably have, how can you determine which letter is meant. – Tarik May 03 '20 at 01:42
  • 3
    It seems that this is called [transliteration](https://stackoverflow.com/q/58674948/589259). Couldn't find a table though, but now you at least know the term. [Here](https://stackoverflow.com/q/48686148/589259) is another language agnostic resource. – Maarten Bodewes May 03 '20 at 01:54
  • 1
    Thanks for the links. I had no idea Ł wasn't mapped to something like L + combining stroke. This explains why the normalization I did didn't work. Sadly, the data set I'm working with has characters in many European languages (polish, czech, german are some of them.) Does this mean I would have to write a transliteration table for every single language in the data set? – David May 03 '20 at 02:54

0 Answers0