Remove non-ASCII characters from string columns in pandas

Question

I have panda dataframe with multiple columns which mixed with values and unwanted characters.

columnA        columnB    columnC        ColumnD
\x00A\X00B     NULL       \x00C\x00D        123
\x00E\X00F     NULL       NULL              456

what I'd like to do is to make this dataframe as below.

columnA  columnB  columnC   ColumnD
AB        NULL       CD        123
EF        NULL       NULL      456

With my codes below, I can remove '\x00' from columnA but columnC is tricky as it is mixed with NULL in certain row.

col_names = cols_to_clean
fixer = dict.fromkeys([0x00], u'')
for i in col_names:
if df[i].isnull().any() == False:
    if df[i].dtype != np.int64:
            df[i] = df[i].map(lambda x: x.translate(fixer))

Is there any efficient way to remove unwanted characters from columnC?

Does something like `.map(lambda x: x.translate(fixer) if x != "NULL" else x)` can be of any help ? — Lohmar ASHAR, Feb 19 '18 at 21:56

cs95 · Accepted Answer · 2019-06-25T01:10:11.350

8

In general, to remove non-ascii characters, use str.encode with errors='ignore':

df['col'] = df['col'].str.encode('ascii', 'ignore').str.decode('ascii')

To perform this on multiple string columns, use

u = df.select_dtypes(object)
df[u.columns] = u.apply(
    lambda x: x.str.encode('ascii', 'ignore').str.decode('ascii'))

Although that still won't handle the null characters in your columns. For that, you replace them using regex:

df2 = df.replace(r'\W+', '', regex=True)

edited Jun 25 '19 at 01:10

answered Feb 19 '18 at 22:02

cs95

379,657
97
704
746

thanks coldspeed, this is very simple and great solution. may I ask what the first line in the codes means? – Joohun Lee Feb 19 '18 at 22:22
@JoohunLee It is an efficient way of determining the column names of string columns. – cs95 Feb 19 '18 at 22:23
thanks, btw, is there a way to delete unwanted character except for certain special character? for example, if I have \x00A\x00-\x00B, applying your codes will return "AB" not "A-B". – Joohun Lee Feb 20 '18 at 00:35
@JoohunLee If you want to retain ASCII characters, see this link: https://stackoverflow.com/a/20078869/4909087 Otherwise, you can use: `x.str.replace('[^\w-]+', '')` add more characters as needed in the character class. – cs95 Feb 20 '18 at 00:36

score 0 · Answer 2 · answered Feb 19 '18 at 23:02

What is the trick with NULL?
If you want to replace string 'NULL' with real NaN use replace:

df.replace('NULL',np.NaN, inplace=True)
print(df.isnull())

Out:

   columnA  columnB  columnC  columnD
0    False     True    False    False
1    False     True     True    False

Or you need replace 'NULL' with empty string, use RegEx in str.replace

df = df.apply(lambda col: col.str.replace(
               r"[\x00|NULL]", "") if col.dtype == object else col)

print (df.isnull())
print (df.values)

Out:


   columnA  columnB  columnC  columnD
0    False    False    False    False
1    False    False    False    False

[['AB' '' 'CD' 123]
 ['EF' '' '' 456]]

Remove non-ASCII characters from string columns in pandas

2 Answers2

Linked