5

I have 2 pandas DataFrames. One containing a list of properly spelled words:

[In]: df1
[Out]:
   words
0  apple
1  phone
2  clock
3  table
4  clean

and one with misspelled words:

[In]: df2
[Out]:
   misspelled
0        aple
1         phn
2        alok
3     garbage
4        appl
5         pho

The goal is to replace the column of misspelled words in the second DataFrame using the list of correctly spelled words from the first DataFrame. The second DataFrame can have multiple repetitions, can be a different size than the first, can have words that aren't in the first DataFrame (or aren't similar enough to match).

I've been trying to use difflib.get_close_matches with some success, but it does not work out perfectly.

This is what I have so far:

x = list(map(lambda x: get_close_matches(x, df1.col1), df2.col1))
good_words = list(map(''.join, x))
l = np.array(good_words, dtype='object')
df2.col1 = pd.Series(l)
df2 = df2[df2.col1 != '']

After applying the transformation, I should get the second DataFrame to look like:

[In]: df2
[Out]:
          0
0     apple
1     phone
2     clock
3       NaN
4     apple
5     phone

If no match is found the row gets replaced with NaN. My problem is that I get a result that looks like this:

[In]: df2
[Out]:
    misspelled
0        apple
1        phone
2   clockclean
3          NaN
4        apple
5        phone

At this time of writing I have not figured out why some of the words are combined. I suspect it has something to do with difflib.get_close_matches matching different words that are similar in length and/or lettering. So far I get aroun ~10% - 15% of the words combined like this out of a whole column. Thanks in advance.

Stealing
  • 187
  • 1
  • 12
  • 1
    Probably a [dupe](https://stackoverflow.com/questions/13928155/spell-checker-for-python). – DYZ Jun 07 '19 at 05:27
  • Not really. Although I'll concede that the outcomes for both are similar, there are two main differences. First in your [question](https://stackoverflow.com/questions/13928155/spell-checker-for-python) its dealing with the actual spell checking as well as replacement. In this question I use `get_close_matches()` for spell checking. The main problem I have is actually replacing the words. Second, I'm using a pandas DataFrame, not a dictionary. – Stealing Jun 07 '19 at 18:38

2 Answers2

5

If want match first value returned by get_close_matches, the cutoff parameter can be adjusted based on your desired threshold, use next with iter for possible add value if no match - here np.nan:

x = [next(iter(x), np.nan) 
          for x in map(lambda x: difflib.get_close_matches(x, df1.words, cutoff = 0.6), df2.misspelled)]
df2['col1'] = x

print (df2)
  misspelled   col1
0       aple  apple
1        phn  phone
2       alok  clock
3    garbage    NaN
4       appl  apple
5        pho  phone
PV8
  • 5,799
  • 7
  • 43
  • 87
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
2

An alternative can be to use pandas-dedupe.
Since you have a messy dataset and a canonical one (i.e. gazette) you can perform gazetteer deduplication.

pandas-dedupe can be particularly powerful because it combines active learning with logistic regression and clustering. It gives users good control on how to perform deduplication while keeping the hard work to a minimum.

Example code

import pandas as pd
import pandas dedupe

import pandas as pd
import pandas_dedupe

clean_data = pd.DataFrame({'name': ['apple', 'phone', 'clock', 'table', 'clean']})
messy_data = pd.DataFrame({'name':['aple', 'phn', 'alok', 'garbage', 'appl', 'apple', 'clock', 'phone', 'phone']})


dd = pandas_dedupe.gazetteer_dataframe(
    clean_data, 
    messy_data, 
    field_properties = 'name', 
    canonicalize=True,
    )

# At this point, pandas-dedupe will ask you to label a few examples as duplicates or distinct.   
# Once done, you hit finish and the output will look like this:   

# name      cluster id  confidence  canonical_name
# 0 aple    0.0          0.636356   apple
# 1 phn     1.0          0.712090   phone
# 2 alok    2.0          0.492138   clock
# 3 garbage NaN          NaN        NaN
# 4 appl    0.0          0.906788   apple
# 5 apple   0.0          0.921466   apple
# 6 clock   2.0          0.921466   clock
# 7 phone   1.0          0.921466   phone
# 8 phone   1.0          0.921466   phone

I know the question is old, but I hope the example is going to be useful for someone in the future :)

iEriii
  • 403
  • 2
  • 7