I have 2 dataframes:
One that acts as a dictionary with the columns:
- "Score"
- "Translation"
- A number of columns with different variations of the word
Another one with one column: "sentences"
The goal is to:
- split the sentences into words
- lookup the words in the dictionary (in different columns) and return the score
- give the score of the word with the highest score as a "sentence score"
df_sentences = pd.DataFrame([["I run"],
["he walks"],
["we run and walk"]],
columns=['Sentence'])
df_dictionary = pd.DataFrame([[10, "I", "you", "he"],
[20, "running", "runs", "run"],
[30, "walking", "walk", "walks"]],
columns=['score', 'variantA', 'variantB', 'variantC'])
Out[1]:
Sentence Score
0 "I run" 30
1 "he walks" 40
2 "we run and walk" "error 'and' not found"
I got quite far using for loops and lists, but that is quite slow and so I am looking for a way of working that let me do all/most of this within the pandas dataframe.
This is how I did it with a for loop:
for sentence in textaslist[:1]:
words = split_into_words(sentence)[0] # returns list of words
length = split_into_words(sentence)[1] #returns number of words
if minsentencelength <= length <= maxsentencelength: # filter out short and long sentences
for word in words:
score = LookupInDictionary.lookup(word, mydictionary)
if str(score) != "None":
do_something()
else:
print(word, " not found in dictionary list")
not_found.append(word) # Add word to not found list
print("The following words were not found in the dictionary: ", not_found)
using
def lookup(word, df):
if word in df.values: # Check if the dictionary contains the word
print(word,"was found in the dictionary")
lookupreturn = df.loc[df.values == word,'score'] # find the score of each word (first column)
score = lookupreturn.values[0] # take only the first instance of the word in the dictionary
return(bare)
The problem is that when I use the pandas "merge" function, I need to specify in which column to look with the right_on left_on parameters and I can not seem to find how to search in the whole dictionary dataframe and return the first column with the score in an efficient way