-2

I have a dataframe with a column that has many misspelled words in it.

I would like to simply (a) identify all misspelled words in each cell on the next column and (b) produce a list of all unique misspelled words found (not duplicates).

For example I have,

Column 1
I worked fertl for a long time.
I worked at fhe desk job.
I am seeing a prw of it. 
cia and nba are both cool places to work

Desired output:

Column 1 Column 2
I worked fertl for a long time. fertl
I worked at fhe desk job. fhe
I am seeing a prw of it. prw
cia and nba are both cool places to work cia, nba

and also i want to get a list of all of these like:

{fertl, fhe, prw, cia, nba}

  • I don't. think this has much to do with Pandas. You might want to use a spell checker as in https://stackoverflow.com/questions/13928155/spell-checker-for-python, but that is something else. – bert wassink Mar 25 '22 at 21:37

1 Answers1

1

Use a list of words. For example english-words

from english_words import english_words_lowet_set as words

df['Column 2'] = [','.join({w for w in x.lower().split()
                            if w not in words})
                  for x in df['Column 1']]

Or, using sets:

df['Column 2'] = [','.join(set(x.lower().split())-words)
                  for x in df['Column 1']]
mozway
  • 194,879
  • 13
  • 39
  • 75
  • If `words` is a set, then you can actually just use `words.intersection(sentence.split())`. No need for a comprehension (unless the order of misspelled words is somehow important). – ddejohn Mar 25 '22 at 21:44
  • @ddejohn yes that's right. I however don't know this library, I just provided it as example so I went for the safe way ;) – mozway Mar 25 '22 at 21:46
  • @ddejohn in this case I think you meant `difference` (`set(sentence.split()).difference(words)`) – mozway Mar 25 '22 at 21:47
  • Ah, yes, correct. In that case though, you'd need `set(sentence.split()) - words`. And yes, `english_words_set` is a `set` :D – ddejohn Mar 25 '22 at 21:50