0

Let's say I have a dataframe with a column of sentences:

     data['sentence']

0    i like to move it move it
1    i like to move ir move it
2    you like to move it
3    i liketo move it move it
4    i like to moveit move it
5    ye like to move it

And I want to check which sentences have words outside of a dictionary, like

     data['sentence']                OOV

0    i like to move it move it      False
1    i like to move ir move it      False
2    you like to move it            False
3    i liketo move it move it       True
4    i like to moveit move it       True
5    ye like to move it             True

Right now I have to iterate over every row doing:


data['OOV'] = False  # out of vocabulary

for i, row in data.iterrows():
    words = set(data['sentence'].split())
    for word in words:    
       if word not in dictionary:
           data.at[i,'OOV'] = True
           break

Is there a way to vectorize (or speed up) this task?

  • what is the dictionary? – mozway Aug 27 '21 at 17:38
  • check this question out: https://stackoverflow.com/questions/52673285/performance-of-pandas-apply-vs-np-vectorize-to-create-new-column-from-existing-c – Kusal Hettiarachchi Aug 27 '21 at 17:45
  • Does this answer your question? [Performance of Pandas apply vs np.vectorize to create new column from existing columns](https://stackoverflow.com/questions/52673285/performance-of-pandas-apply-vs-np-vectorize-to-create-new-column-from-existing-c) – Kusal Hettiarachchi Aug 27 '21 at 17:46
  • The dictionary is just a list of valid words. ```dictionary = ['i','like','move',....]``` – Bernardo Henz Aug 27 '21 at 19:45

2 Answers2

2

Your requirement is not clear without knowing the content of the dictionary (that is I imagine more a list in the python sense).

Yet, assuming the reference words are "I like to move it", here is how to flag rows in which the sentence contains words outside of dictionary:

dictionary = set(['i', 'like', 'to', 'move', 'it'])
df['OOV'] = df['data'].str.split(' ').apply(lambda x: not set(x).issubset(dictionary))

# only for illustration:
df['words'] = df['data'].str.split(' ').apply(set)
df['words_outside'] = df['data'].str.split(' ').apply(lambda x: set(x).difference(dictionary))

output:

                        data    OOV                            words words_outside
0  i like to move it move it  False          {like, to, it, i, move}            {}
1  i like to move ir move it   True      {like, to, it, i, move, ir}          {ir}
2        you like to move it   True        {move, like, to, it, you}         {you}
3   i liketo move it move it   True            {liketo, it, move, i}      {liketo}
4   i like to moveit move it   True  {like, to, it, i, move, moveit}      {moveit}
5         ye like to move it   True         {like, to, it, move, ye}          {ye}
mozway
  • 194,879
  • 13
  • 39
  • 75
1

Since I do not have complete context of the dictionary and other details, I would suggest using df.apply(operation) , it often causes speed ups rather than iterating.

pandas.DataFrame.apply

Agrover112
  • 461
  • 5
  • 9