How to check if words are in dictionary in pandas dataframe

Question

Let's say I have a dataframe with a column of sentences:

     data['sentence']

0    i like to move it move it
1    i like to move ir move it
2    you like to move it
3    i liketo move it move it
4    i like to moveit move it
5    ye like to move it

And I want to check which sentences have words outside of a dictionary, like

     data['sentence']                OOV

0    i like to move it move it      False
1    i like to move ir move it      False
2    you like to move it            False
3    i liketo move it move it       True
4    i like to moveit move it       True
5    ye like to move it             True

Right now I have to iterate over every row doing:


data['OOV'] = False  # out of vocabulary

for i, row in data.iterrows():
    words = set(data['sentence'].split())
    for word in words:    
       if word not in dictionary:
           data.at[i,'OOV'] = True
           break

Is there a way to vectorize (or speed up) this task?

check this question out: https://stackoverflow.com/questions/52673285/performance-of-pandas-apply-vs-np-vectorize-to-create-new-column-from-existing-c — Kusal Hettiarachchi, Aug 27 '21 at 17:45
Does this answer your question? [Performance of Pandas apply vs np.vectorize to create new column from existing columns](https://stackoverflow.com/questions/52673285/performance-of-pandas-apply-vs-np-vectorize-to-create-new-column-from-existing-c) — Kusal Hettiarachchi, Aug 27 '21 at 17:46
The dictionary is just a list of valid words. ```dictionary = ['i','like','move',....]``` — Bernardo Henz, Aug 27 '21 at 19:45

score 2 · Accepted Answer · answered Aug 27 '21 at 18:01

Your requirement is not clear without knowing the content of the dictionary (that is I imagine more a list in the python sense).

Yet, assuming the reference words are "I like to move it", here is how to flag rows in which the sentence contains words outside of dictionary:

dictionary = set(['i', 'like', 'to', 'move', 'it'])
df['OOV'] = df['data'].str.split(' ').apply(lambda x: not set(x).issubset(dictionary))

# only for illustration:
df['words'] = df['data'].str.split(' ').apply(set)
df['words_outside'] = df['data'].str.split(' ').apply(lambda x: set(x).difference(dictionary))

output:

                        data    OOV                            words words_outside
0  i like to move it move it  False          {like, to, it, i, move}            {}
1  i like to move ir move it   True      {like, to, it, i, move, ir}          {ir}
2        you like to move it   True        {move, like, to, it, you}         {you}
3   i liketo move it move it   True            {liketo, it, move, i}      {liketo}
4   i like to moveit move it   True  {like, to, it, i, move, moveit}      {moveit}
5         ye like to move it   True         {like, to, it, move, ye}          {ye}

score 1 · Answer 2 · answered Aug 27 '21 at 17:47

1

Since I do not have complete context of the dictionary and other details, I would suggest using df.apply(operation) , it often causes speed ups rather than iterating.

pandas.DataFrame.apply

answered Aug 27 '21 at 17:47

Agrover112

461
5
9

How to check if words are in dictionary in pandas dataframe

2 Answers2