1

I have a data frame as below.

ID  Word       Synonyms
------------------------
1   drove      drive
2   office     downtown
3   everyday   daily
4   day        daily
5   work       downtown

I'm reading a sentence and would like to replace words in that sentence with their synonyms as defined above. Here is my code:

import nltk
import pandas as pd
import string

sdf = pd.read_excel('C:\synonyms.xlsx')
sd = sdf.apply(lambda x: x.astype(str).str.lower())
words = 'i drove to office everyday in my car'

#######

def tokenize(text):
    text = ''.join([ch for ch in text if ch not in string.punctuation])
    tokens = nltk.word_tokenize(text)
    synonym = synonyms(tokens)
    return synonym

def synonyms(words):
    for word in words:
        if(sd[sd['Word'] == word].index.tolist()):
            idx = sd[sd['Word'] == word].index.tolist()
            word = sd.loc[idx]['Synonyms'].item()
        else:
            word
    return word

print(tokenize(words))

The code above tokenizes the input sentence. I would like to achieve the following output:

In: i drove to office everyday in my car
Out: i drive to downtown daily in my car

But the output I get is

Out: car

If I skip the synonyms function, then my output has no issues and is split into individual words. I am trying to understand what I'm doing wrong in the synonyms function. Also, please advise if there is a better solution to this problem.

RData
  • 959
  • 1
  • 13
  • 33

2 Answers2

1

I would take advantage of Pandas/NumPy indexing. Since your synonym mapping is many-to-one, you can re-index using the Word column.

sd = sd.applymap(str.strip).applymap(str.lower).set_index('Word').Synonyms
print(sd)
Word
drove          drive
office      downtown
everyday       daily
day            daily
Name: Synonyms, dtype: object

Then, you can easily align a list of tokens to their respective synonyms.

words = nltk.word_tokenize(u'i drove to office everyday in my car')
sentence = sd[words].reset_index()
print(sentence)
       Word  Synonyms
0         i       NaN
1     drove     drive
2        to       NaN
3    office  downtown
4  everyday     daily
5        in       NaN
6        my       NaN
7       car       NaN

Now, it remains to use the tokens from Synonyms, falling back to Word. This can be achieved with

sentence = sentence.Synonyms.fillna(sentence.Word)
print(sentence.values)
[u'i' 'drive' u'to' 'downtown' 'daily' u'in' u'my' u'car']
Igor Raush
  • 15,080
  • 1
  • 34
  • 55
  • this seems to have issue if i have many to one mapping , I have updated my synonym table in the question and my input as "i drove to office everyday in my car to work" the last word "work" is not replaced – RData Jan 24 '17 at 20:02
  • @Rohit are you sure? I just checked, it works for me. – Igor Raush Jan 24 '17 at 20:09
  • yea , here is my print Word drove drive office downtown everyday daily rattle rat rattles rat mold molding bumps bump bumpy bump work downtown --------- Word Synonyms 0 i NaN 1 drove drive 2 to NaN 3 office downtown 4 everyday daily 5 in NaN 6 my NaN 7 car NaN 8 to NaN 9 work NaN ['i' 'drive' 'to' 'downtown' 'daily' 'in' 'my' 'car' 'to' 'work'] – RData Jan 24 '17 at 20:10
  • here is output i see : ['i' 'drive' 'to' 'downtown' 'daily' 'in' 'my' 'car' 'to' 'work'] – RData Jan 24 '17 at 20:12
  • What's your version of Pandas/Python? also, please post a full, runnable example on [pastebin](http://pastebin.com/), the output in your comment is illegible. – Igor Raush Jan 24 '17 at 20:35
  • The problem could lie in the way you are loading your synonyms. I am creating the data frame inline using `pd.read_csv`. – Igor Raush Jan 24 '17 at 20:38
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/133923/discussion-between-rohit-gopidi-and-igor-raush). – RData Jan 24 '17 at 20:50
0
import re
import pandas as pd
sdf = pd.read_excel('C:\synonyms.xlsx')
rep = dict(zip(sdf.Word, sdf.Synonyms)) #convert into dictionary

words = "i drove to office everyday in my car"
rep = dict((re.escape(k), v) for k, v in rep.iteritems())
pattern = re.compile("|".join(rep.keys()))
rep = pattern.sub(lambda m: rep[re.escape(m.group(0))], words)

print rep

output

i drive to downtown daily in my car

Courtesy : https://stackoverflow.com/a/6117124/6626530

Community
  • 1
  • 1
Shijo
  • 9,313
  • 3
  • 19
  • 31