Unicode error when trying to remove non-ascii chars

Question

I am parsing csv files and would like to remove non-ascii characters when they appear. Actually, I only need digits, but when I try to remove non-digit characters, I get an UnicodeEncodeError.

I have the following function:

def remove_non_ascii(text):
    return ''.join(re.findall("\d+", str(text)))

Also tried (just to remove non-ascii chars):

def remove_non_ascii(text):
    return ''.join(i for i in str(text) if ord(i)<128)

When I print the result of the following, I get the correct result (for both functions)

print(remove_non_ascii('E-Mail Adresse des Empfängers'))

However, when I apply the function to the dataframe column df[col] = df[col].apply(remove_non_ascii), I get the UnicodeEncodeError.

What am I doing wrong ?

Possible duplicate of [Regular expression that finds and replaces non-ascii characters with Python](https://stackoverflow.com/questions/2758921/regular-expression-that-finds-and-replaces-non-ascii-characters-with-python) — ayorgo, Dec 08 '18 at 20:00
Not all digits in Unicode are ASCII, btw. That said, your question is off-topic because it lacks a [mcve]. — Ulrich Eckhardt, Dec 08 '18 at 20:02

Stubbs · Accepted Answer · 2018-12-08T20:06:25.633

1

One possible solution: You need to import string and change the function to

setV = set(string.printable)
return ''.join(filter(lambda x: x in setV, text))

This would remove all characters not in the set

Just noticed you put that you only need digits. Heres a more useful solution without the need to import string:

def remove_non_ascii(text):
    setV = set("1234567890")
    return ''.join(filter(lambda x: x in setV, text))

edited Dec 08 '18 at 20:06

answered Dec 08 '18 at 19:48

Stubbs

193
1
7

Thanks, Second answer works perfectly. Just need to change set = set("1234567890") to setV = set("1234567890"), otherwise it gives error local variable referenced before assignement. – VincFort Dec 08 '18 at 20:05
Oops, my mistake – Stubbs Dec 08 '18 at 20:06

Unicode error when trying to remove non-ascii chars

1 Answers1