0

I want to remove meaningless or invalid data on cell. (A combination of meaningless alphabets or only numbers in cells)

My data is below.

ID         A1           B1          C1
1          apple        adfs        banana
2          I love you   mom         111
3          zaaaaf       dad         348080

Expected output as below.

ID         A1           B1          C1
1          apple                    banana
2          I love you   mom         
3                       dad          

How can I this?

Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
purplecollar
  • 176
  • 1
  • 14
  • Nice set of tags. The only thing used (from the output) I can see is maybe python/pandas. Did you actually try to solve this using any of them? [mre] please. – Patrick Artner May 27 '20 at 05:50

1 Answers1

2

You can compare values to some dictonary, here from ntlk and if not match remove values, but still is possible some values are removed like mom if not exist in dictionary ntlk:

import nltk
words = set(nltk.corpus.words.words())

#https://stackoverflow.com/a/41290205
f = lambda x: " ".join(w for w in nltk.wordpunct_tokenize(x) if w.lower() in words)

#apply only for object columns (obviously strings)
cols = df.select_dtypes(object).columns
df[cols] = df[cols].applymap(f)
print (df)
   ID          A1   B1      C1
0   1       apple       banana
1   2  I love you             
2   3              dad        
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252