1

I have 2 dataframes:

One that acts as a dictionary with the columns:

  • "Score"
  • "Translation"
  • A number of columns with different variations of the word

Another one with one column: "sentences"

The goal is to:

  • split the sentences into words
  • lookup the words in the dictionary (in different columns) and return the score
  • give the score of the word with the highest score as a "sentence score"
df_sentences = pd.DataFrame([["I run"], 
    ["he walks"], 
    ["we run and walk"]], 
    columns=['Sentence'])

df_dictionary = pd.DataFrame([[10, "I", "you", "he"], 
    [20, "running", "runs", "run"], 
    [30, "walking", "walk", "walks"]], 
    columns=['score', 'variantA', 'variantB', 'variantC'])

Out[1]: 
   Sentence           Score
0  "I run"             30
1  "he walks"          40
2  "we run and walk"   "error 'and' not found"

I got quite far using for loops and lists, but that is quite slow and so I am looking for a way of working that let me do all/most of this within the pandas dataframe.

This is how I did it with a for loop:

for sentence in textaslist[:1]:
words = split_into_words(sentence)[0] # returns list of words
length = split_into_words(sentence)[1] #returns number of words
if minsentencelength <= length <= maxsentencelength: # filter out short and long sentences                                                     
    for word in words:
        score = LookupInDictionary.lookup(word, mydictionary)
        if str(score) != "None":
            do_something()
        else:
            print(word, " not found in dictionary list")
            not_found.append(word)      # Add word to not found list     
                                                   
print("The following words were not found in the dictionary: ", not_found)

using

def lookup(word, df):
if word in df.values:                                                       # Check if the dictionary contains the word
    print(word,"was found in the dictionary")
    lookupreturn = df.loc[df.values == word,'score']                         # find the score of each word (first column)
    score = lookupreturn.values[0]                                           # take only the first instance of the word in the dictionary
    return(bare)   

The problem is that when I use the pandas "merge" function, I need to specify in which column to look with the right_on left_on parameters and I can not seem to find how to search in the whole dictionary dataframe and return the first column with the score in an efficient way

ALollz
  • 57,915
  • 7
  • 66
  • 89
  • 1
    Please provide a small set of sample data as text that we can copy and paste. Include the corresponding desired result. Check out the guide on [how to make good reproducible pandas examples](https://stackoverflow.com/a/20159305/3620003). – timgeb Aug 07 '20 at 15:17
  • 1
    I added some sample data, hope it is more clear now :-) – the other guy Aug 07 '20 at 15:47

3 Answers3

2

if you modify your dictionary in this format [word, score] then you can split the the sentences by word, merge with the dictionary then groupby and sum the score.
Since this uses pandas functions it should be fast enough for your dataset, not sure if it's possible to make it faster than this.

tl;dr

df_sentences = pd.DataFrame([["I run"], 
    ["he walks"], 
    ["we run and walk"]], 
    columns=['Sentence'])

df_dictionary = pd.DataFrame([[10, "I", "you", "he"], 
    [20, "running", "runs", "run"], 
    [30, "walking", "walk", "walks"]], 
    columns=['score', 'variantA', 'variantB', 'variantC'])

df_dictionary = pd.melt(df_dictionary, id_vars=['score'])[['value', 'score']]

df_sentences['words'] = df_sentences['Sentence'].str.split()
df_sentences = df_sentences.explode('words')

sentence_score = df_sentences.merge(df_dictionary, how='left', left_on='words', right_on='value')[['Sentence', 'score']]

sentence_score_sum = sentence_score.fillna('NaN').groupby('Sentence').sum()
# or
sentence_score_max = sentence_score.fillna('NaN').groupby('Sentence').max()

Break down

to modify the dictionary into [word, score] format you can use melt like so

df_dictionary = pd.DataFrame([[10, "I", "you", "he"], 
    [20, "running", "runs", "run"], 
    [30, "walking", "walk", "walks"]], 
    columns=['score', 'variantA', 'variantB', 'variantC'])
df_dictionary = pd.melt(df_dictionary, id_vars=['score'])[['value', 'score']]

which will give you

     value  score
0        I     10
1  running     20
2  walking     30
3      you     10
4     runs     20
5     walk     30
6       he     10
7      run     20
8    walks     30

Now to work with Sentences, we want to be able to extract each word on it's own while keeping track of the main sentence. let's add a new column that contains the words as a list

df_sentences = pd.DataFrame([["I run"], 
    ["he walks"], 
    ["we run and walk"]], 
    columns=['Sentence'])

df_sentences['words'] = df_sentences['Sentence'].str.split()

which would give us

          Sentence                 words
0            I run              [I, run]
1         he walks           [he, walks]
2  we run and walk  [we, run, and, walk]

Then explode the words

df_sentences = df_sentences.explode('words')

which give you

          Sentence  words
0            I run      I
0            I run    run
1         he walks     he
1         he walks  walks
2  we run and walk     we
2  we run and walk    run
2  we run and walk    and
2  we run and walk   walk

now we merge it all together

sentence_score = df_sentences.merge(df_dictionary, how='left', left_on='words', right_on='value')[['Sentence', 'score']]

giving us

          Sentence  score
0            I run   10.0
1            I run   20.0
2         he walks   10.0
3         he walks   30.0
4  we run and walk    NaN
5  we run and walk   20.0
6  we run and walk    NaN
7  we run and walk   30.0

Now we can combine groupby with sum to sum by score for each sentence

Note that pandas would treat NaN as 0.0 which we don't want, thus we are filling na using fillna to the string of "NaN".

sentence_score_sum = sentence_score.fillna('NaN').groupby('Sentence').sum()

giving you

                 score
Sentence
I run             30.0
he walks          40.0
we run and walk    NaN

Your question said you wanted to give the highest word score to the sentence but your expected output was showing the sum, if you need the max score then it's a simple

sentence_score_max = sentence_score.fillna('NaN').groupby('Sentence').max()

giving you

                 score
Sentence
I run             20.0
he walks          30.0
we run and walk    NaN

Note: This solution relies on having UNIQUE sentences, if you have duplicate sentences you would either drop_duplicates before starting or you could apply reset_index(drop=False) before the merge to keep the index and then groupby on the old index instead of the Sentence.

Jimmar
  • 4,194
  • 2
  • 28
  • 43
  • are you talking about filtering the dataframe and then appending the columns at the bottom so that I essentially get 1 huge list with 2 columns like in @kentaro s example? That might be a good approach indeed. Some example code how to do the split and append would be nice :) – the other guy Aug 07 '20 at 18:40
  • 1
    yes, look into the `melt` method it might help. also if you want an example I can probably come up with something – Jimmar Aug 07 '20 at 18:44
  • 1
    @theotherguy I've updated my answer with code and a breakdown of how it works. Note that this would give you a `NaN` for when a word is missing and wouldn't tell you which words were missing, but you can get those missing words from the intermediate steps right after the merge – Jimmar Aug 08 '20 at 18:19
  • 1
    Wow, thanks a lot for the breakdown! This is exactly what I was looking for. My hat off to you good sir – the other guy Aug 08 '20 at 21:27
1

I would use the following regex approach:

# Store scores using index
score_dict = {x:0 for x in df_sentences.index}

# Loop through each row in the score df (df_dictionary):
for row in df_dictionary.values:
  # Access the score
  score = row[0]
  # Access the words & convert to a pattern
  words = "|".join([re.escape(x) for x in row[1:]])
  pattern = re.compile(r"\b(" + words + r")\b", re.I|re.M)
  
  # Loop through each row in the main df (df_sentences):
  for idx, row in df_sentences.iterrows():
    # Find the number of matches in the sentence
    matches = pattern.findall(row["Sentence"])
    # Multiply to get the score
    n_score = len(matches) * score
    # Store it using the index as key
    score_dict[idx] += n_score

# Now, add the dict as a column (or map it back to the main df) 
df_sentences["score"] = df_sentences.index.map(score_dict)

    Sentence    score
0   I run   30
1   he walks    40
2   we run and walk 50
Yaakov Bressler
  • 9,056
  • 2
  • 45
  • 69
  • Thanks for the help, but my data is huge and with for loops it takes ages to run over all the iterations. – the other guy Aug 07 '20 at 18:44
  • You could try to convert to convert to a function and map `df_dictionary` to the function while applying to `df_sentences`. Frankly, what you're trying to do is a compute intensive problem. – Yaakov Bressler Aug 07 '20 at 19:26
  • Unless you don't care about multiple words, in which case, you can do `df_sentences.Sentences.str.replace(word, n)` replacing `word` and `n` for words and scores.... – Yaakov Bressler Aug 07 '20 at 19:28
1

Do you care about duplicates? If I had a string like "I I I," that'd technically be 30 points.

Also, is there a particular reason you're using a dataframe to store your scored words?

Quick & dirty using set intersections to efficiently remove duplicates:

dictionary ={
"I": 10, "you": 10, "he": 10, 
"running": 20, "runs": 20, "run": 20, 
"walking": 30, "walk": 30, "walks": 30
}

df = pd.DataFrame({
    "Sentences":[
        "I run and he walks",
        "We walk and he runs",
        "I Run you run he runs",
        "I run he runs",
        "I I I I I"
    ]})

def split_score(sentence):
    x = sentence.split(' ')
    x = set(x) # remove duplicate words
    y = x.intersection(set(dictionary.keys())) # find matches in the dictionary
    z = x.difference(set(dictionary.keys())) # find words outside the dictionary
    if len(z) > 0:
        score = -1 # If non-dictionary words are found, fail
    elif len(z) == 0:
        score = sum([dictionary[word] for word in y])
    return score

df['Points'] = df['Sentences'].apply(lambda x: split_score(x))
df

enter image description here

Kentaro
  • 145
  • 8
  • Also, do you WANT it to fail if there's a word that isn't in the dictionary? – Kentaro Aug 07 '20 at 17:09
  • The reason for using dataframes is that the dictionary is a huge csv file. But the way you do it and like @Jimmar also noted, combining the columns with the variations in one column/dictionary/list might be a good approach. Now I just need to find out how to get that done :-D And yes, I want it to fail if a word is not found in the dictionary. It should then be added to the "not found" list – the other guy Aug 07 '20 at 18:38
  • 1
    note that `apply` is a bit slow, look into using pandarallel to do it in parallel and speed it up. – Jimmar Aug 07 '20 at 18:46
  • Will "help" will trigger "he" in this solution? – Yaakov Bressler Aug 07 '20 at 19:24
  • This isn't using regex, it's comparing entire strings to one another. One confounding factor would be punctuation, though. If there are any commas or periods, this will fail. I updated my answer to include the failure case. Also I set the failure case to -1 to ensure you don't convert the whole column into pandas objects or strings. This will make it easier to handle the result later on. – Kentaro Aug 07 '20 at 20:22