Remove all rows with a special character in pandas

Question

I have a dataframe where there are special characters (like a square) in one of the columns EPI_ID. I want to remove all rows that contain this special character. This isn't a standard character and I haven't found issues similar to this in a dataframe, mostly as strings. Nevertheless, I am having trouble identifying these columns. Any suggestions?

df

EPI_ID    stuff
2342F     randoM_words
FER43     predictive_words
u'\u25A1' blank

My attempt:

df[~df['EPI_ID'].apply(lambda x: x.encode('ascii') == True)]

My results are throwing False for every row.

Expected output:

EPI_ID    stuff
2342F     randoM_words
FER43     predictive_words

Edit: the square doesn't come up in the mock df. But this is what it is square

the example doesnot have a square, also please post the expected output. Thanks — anky, Feb 01 '19 at 18:27
`j[~j['EPI_ID'].apply(lambda x: x.encode('ascii') == b'\x1a')]` --> this works to remove the row if I specifically select that character. — CandleWax, Feb 01 '19 at 18:36
Please include/paste the first 5 rows that demonstrate this issue. — Andy Hayden, Feb 01 '19 at 18:41
@MartinBobak how this works: `df.EPI_ID.str.replace('\W', '')` ? — anky, Feb 01 '19 at 18:45

score 4 · Accepted Answer · answered Feb 01 '19 at 18:50

Assuming your DataFrame looks something like this:

>>> df = pd.DataFrame({'EPI_ID': ['2343F', 'FER43', 'DF' + u'\u25A1' + '123', 'PQRX74'], 'STUFF': ['abc', 'def', 'ghi', 'jkl']})

>>> df

   EPI_ID STUFF
0   2343F   abc
1   FER43   def
2  DF□123   ghi
3  PQRX74   jkl

You can use str.contains, which handles regex:

df.loc[df['EPI_ID'].str.contains(r'[^\x00-\x7F]+') == False]

   EPI_ID STUFF
0   2343F   abc
1   FER43   def
3  PQRX74   jkl

Regex courtesy of this answer: (grep) Regex to match non-ASCII characters?

Remove all rows with a special character in pandas

1 Answers1