Spelling Correction in a Dataframe Object , Python

Question

I need to create a function that takes in a dataframe type object (a column containing text) called new_data and i need to compare the words with my reference. My reference ref_data consists of 2 columns , one with the wrongly spelt word (the same form as that of new_data) and the 2nd column consists of its corrected version.

To put it down simply , i need to compare each word of new_data with the 1st column of ref_data, if it matches , it will return the word of the 2nd column corresponding to that word.

For example , if the word of new_data matches word of ref_data on 3rd row, then the word in column 2 of 3rd row replaces it. Will provide any more clarification if needed. here is what i tried:

I have tried this:

x = [line for line in ref_data['word']] #x is a list of all incorrect words
y = [line for line in ref_data['final']] #y is a list of all correct words
def replace_words(x): #function
for line in x: #iterate over lines in list
    for word in line.split(): #iterate over words in list
        if word == x:   #i dont know the syntax to compare with it.problem here
           return (word = y)  #i need to return y of the same index.

please note: each row in new_data consists of a sentence.. i could convert it to a list any time.. — Hypothetical Ninja, Feb 13 '14 at 11:29
Why the downvote? Sword's function is not the best approach, but he's shown effort and the problem is answerable. Ambiguous downvotes should come with constructive criticism. — Dan Allan, Feb 13 '14 at 17:46
Thanks for being supportive. its a big struggle as a beginner. Many doubts aren't available on the internet and there aren't many people here in my city who know python. If u look , i have a dataframe in which one column contains addresses. i need to iterate over each word of each address and replace it with the right one.. — Hypothetical Ninja, Feb 14 '14 at 04:01

score 2 · Accepted Answer · answered Feb 13 '14 at 17:46

2

The method replace is good for this. Instead of putting the incorrect/correct mapping into two columns of a DataFrame, use a Series.

corrections = Series(correct_spellings, index=incorrect_spellings)
new_data_corrected = new_data.replace(corrections)

Here's a simple example. I'm using letters for simplicity; of course it would work the same with words.

In [10]: new_data
Out[10]: 
0    a
1    b
2    c
dtype: object

In [11]: corrections
Out[11]: 
c    C
b    B
dtype: object

In [12]: new_data.replace(corrections)
Out[12]: 
0    a
1    B
2    C
dtype: object

answered Feb 13 '14 at 17:46

Dan Allan

34,073
6
70
63

will it work with sentences ? my data contains sentences. so i will need to iterate over each word of sentence to replace it.. your approach is awesome. Can it be extended to incorporate some more complexities? i'll upload the image of my data , if u want. – Hypothetical Ninja Feb 14 '14 at 03:58
That's a little too complex to discuss without something concrete to work with. I suggest you make some short toy examples and open a new question. Or see if [this old answer of mine](http://stackoverflow.com/questions/17116814/pandas-how-do-i-split-text-in-a-column-into-multiple-columns/17116976#17116976) gets you close enough. You'll have to split your sentences into columns of words and then apply. – Dan Allan Feb 14 '14 at 19:27
my approach was wrong.. i used it as a function instead to process in real time.. thanx bro – Hypothetical Ninja Feb 15 '14 at 04:13

Spelling Correction in a Dataframe Object , Python

1 Answers1