-2

I am parsing csv files and would like to remove non-ascii characters when they appear. Actually, I only need digits, but when I try to remove non-digit characters, I get an UnicodeEncodeError.

I have the following function:

def remove_non_ascii(text):
    return ''.join(re.findall("\d+", str(text)))

Also tried (just to remove non-ascii chars):

def remove_non_ascii(text):
    return ''.join(i for i in str(text) if ord(i)<128)

When I print the result of the following, I get the correct result (for both functions)

print(remove_non_ascii('E-Mail Adresse des Empfängers'))

However, when I apply the function to the dataframe column df[col] = df[col].apply(remove_non_ascii), I get the UnicodeEncodeError.

What am I doing wrong ?

Will Vousden
  • 32,488
  • 9
  • 84
  • 95
VincFort
  • 1,150
  • 12
  • 29
  • Possible duplicate of [Regular expression that finds and replaces non-ascii characters with Python](https://stackoverflow.com/questions/2758921/regular-expression-that-finds-and-replaces-non-ascii-characters-with-python) – ayorgo Dec 08 '18 at 20:00
  • Not all digits in Unicode are ASCII, btw. That said, your question is off-topic because it lacks a [mcve]. – Ulrich Eckhardt Dec 08 '18 at 20:02

1 Answers1

1

One possible solution: You need to import string and change the function to

setV = set(string.printable)
return ''.join(filter(lambda x: x in setV, text))

This would remove all characters not in the set

Just noticed you put that you only need digits. Heres a more useful solution without the need to import string:

def remove_non_ascii(text):
    setV = set("1234567890")
    return ''.join(filter(lambda x: x in setV, text))
Stubbs
  • 193
  • 1
  • 7
  • Thanks, Second answer works perfectly. Just need to change set = set("1234567890") to setV = set("1234567890"), otherwise it gives error local variable referenced before assignement. – VincFort Dec 08 '18 at 20:05
  • Oops, my mistake – Stubbs Dec 08 '18 at 20:06