How to find and remove invalid / meaningless text in python?

Question

I want to remove meaningless or invalid data on cell. (A combination of meaningless alphabets or only numbers in cells)

My data is below.

ID         A1           B1          C1
1          apple        adfs        banana
2          I love you   mom         111
3          zaaaaf       dad         348080

Expected output as below.

ID         A1           B1          C1
1          apple                    banana
2          I love you   mom         
3                       dad

How can I this?

Nice set of tags. The only thing used (from the output) I can see is maybe python/pandas. Did you actually try to solve this using any of them? [mre] please. — Patrick Artner, May 27 '20 at 05:50

jezrael · Accepted Answer · 2020-05-27T05:55:11.757

2

You can compare values to some dictonary, here from ntlk and if not match remove values, but still is possible some values are removed like mom if not exist in dictionary ntlk:

import nltk
words = set(nltk.corpus.words.words())

#https://stackoverflow.com/a/41290205
f = lambda x: " ".join(w for w in nltk.wordpunct_tokenize(x) if w.lower() in words)

#apply only for object columns (obviously strings)
cols = df.select_dtypes(object).columns
df[cols] = df[cols].applymap(f)
print (df)
   ID          A1   B1      C1
0   1       apple       banana
1   2  I love you             
2   3              dad

edited May 27 '20 at 05:55

answered May 27 '20 at 05:47

jezrael

822,522
95
1,334
1,252

@jezrael I ran your code but occurred TypeError: expected string or bytes-like object – purplecollar May 27 '20 at 06:29
@purplecollar - It seems some data related problem, are data confidental? – jezrael May 27 '20 at 06:31
@jezrael type of columns are object and there are many blank columns. Also, many columns has sentence value. – purplecollar May 27 '20 at 06:40
@purplecollar - Can you test if values are strings? [link](https://stackoverflow.com/questions/42672552/pandas-cast-column-to-string-does-not-work/42672574#42672574) – jezrael May 27 '20 at 06:42
@purplecollar - Another idea, how working `df[cols] = df[cols].fillna('').applymap(f)` ? – jezrael May 27 '20 at 07:11
1

@jezrael okay, I will try your solutions and then leave feedback – purplecollar May 28 '20 at 05:23

How to find and remove invalid / meaningless text in python?

1 Answers1