0

I have a dataframe where there are special characters (like a square) in one of the columns EPI_ID. I want to remove all rows that contain this special character. This isn't a standard character and I haven't found issues similar to this in a dataframe, mostly as strings. Nevertheless, I am having trouble identifying these columns. Any suggestions?

df

EPI_ID    stuff
2342F     randoM_words
FER43     predictive_words
u'\u25A1' blank

My attempt:

df[~df['EPI_ID'].apply(lambda x: x.encode('ascii') == True)]

My results are throwing False for every row.

Expected output:

EPI_ID    stuff
2342F     randoM_words
FER43     predictive_words

Edit: the square doesn't come up in the mock df. But this is what it is square

CandleWax
  • 2,159
  • 2
  • 28
  • 46

1 Answers1

4

Assuming your DataFrame looks something like this:

>>> df = pd.DataFrame({'EPI_ID': ['2343F', 'FER43', 'DF' + u'\u25A1' + '123', 'PQRX74'], 'STUFF': ['abc', 'def', 'ghi', 'jkl']})

>>> df

   EPI_ID STUFF
0   2343F   abc
1   FER43   def
2  DF□123   ghi
3  PQRX74   jkl

You can use str.contains, which handles regex:

df.loc[df['EPI_ID'].str.contains(r'[^\x00-\x7F]+') == False]

   EPI_ID STUFF
0   2343F   abc
1   FER43   def
3  PQRX74   jkl

Regex courtesy of this answer: (grep) Regex to match non-ASCII characters?

r.ook
  • 13,466
  • 2
  • 22
  • 39