Is there a way to identify and create a list of all misspelled words in a dataframe?

Question

I have a dataframe with a column that has many misspelled words in it.

I would like to simply (a) identify all misspelled words in each cell on the next column and (b) produce a list of all unique misspelled words found (not duplicates).

For example I have,

Column 1
I worked fertl for a long time.
I worked at fhe desk job.
I am seeing a prw of it. 
cia and nba are both cool places to work

Desired output:

Column 1	Column 2
I worked fertl for a long time.	fertl
I worked at fhe desk job.	fhe
I am seeing a prw of it.	prw
cia and nba are both cool places to work	cia, nba

and also i want to get a list of all of these like:

{fertl, fhe, prw, cia, nba}

I don't. think this has much to do with Pandas. You might want to use a spell checker as in https://stackoverflow.com/questions/13928155/spell-checker-for-python, but that is something else. — bert wassink, Mar 25 '22 at 21:37

mozway · Answer 1 · 2022-03-25T21:54:41.710

1

Use a list of words. For example english-words

from english_words import english_words_lowet_set as words

df['Column 2'] = [','.join({w for w in x.lower().split()
                            if w not in words})
                  for x in df['Column 1']]

Or, using sets:

df['Column 2'] = [','.join(set(x.lower().split())-words)
                  for x in df['Column 1']]

edited Mar 25 '22 at 21:54

answered Mar 25 '22 at 21:42

mozway

194,879
13
39
75

If `words` is a set, then you can actually just use `words.intersection(sentence.split())`. No need for a comprehension (unless the order of misspelled words is somehow important). – ddejohn Mar 25 '22 at 21:44
@ddejohn yes that's right. I however don't know this library, I just provided it as example so I went for the safe way ;) – mozway Mar 25 '22 at 21:46
@ddejohn in this case I think you meant `difference` (`set(sentence.split()).difference(words)`) – mozway Mar 25 '22 at 21:47
Ah, yes, correct. In that case though, you'd need `set(sentence.split()) - words`. And yes, `english_words_set` is a `set` :D – ddejohn Mar 25 '22 at 21:50

Is there a way to identify and create a list of all misspelled words in a dataframe?

1 Answers1