-2

I am trying to select segments/ clauses of sentences, based on word pairs with which the segments should start. For example, I am interested in sentence segments that start with "what does" or "what is', etc.

To do this, I am looping over two DataFrames, using an if statement inside a for loop as shown below. The first DataFrame df1['Sentence'] contains the sentences. The other df2['First2'] contains the pairs of starting words. However, the function seems to loop only over the first word pair in the for loop - after the first item, it does not return to the for loop. My code seems to work when I would pass lists to it, but not when I pass DataFrames. I have tried the solutions mentioned in Pythonic way to combine FOR loop and IF statement. But they do not work for my DataFrame. I would love to know how to solve this.

DataFrames:

   'Sentence'                                   'First2'     
0  If this is a string what does it say?      0  what does    
1  And this is a string, should it say more?  1  should it    
2  This is yet another string.                2

My code looks as follows:

import pandas as pd    
a = df1['Sentence']
b = df2['First2'] 

#The function seems to loop over all r's but not over all b's:
def func(r): 
    for i in b:
        if i in r:
            # The following line selects the sentence segment that starts with 
            # the words in `First2`, up to the end of the sentence.
            q = r[r.index(i):] 
            return q
        else:
            return ''

df1['Clauses'] = a.apply(func)

This is the result:

what does it say?

This is correct but incomplete. The code seems to loop over all r's but not over all b's. How to get the desired result, as below?

what does it say?
should it say more?
twhale
  • 725
  • 2
  • 9
  • 25
  • Using `if i in r:` inside the `for i in b:` set `i` to on evalue then changes it to another - try different variable names? – doctorlove Apr 30 '18 at 09:18
  • You always return in the first iteration of the loop, either with `q` or the empty string. You never see the second element of `b`. –  Apr 30 '18 at 09:19
  • @doctorlove: But what I want to know if is if `i` is in both `b` AND `r` ... Would that work if I change `i` to for example `x`, in one of the statements? – twhale Apr 30 '18 at 09:23
  • @Evert: Yes, that seems right. But how to change that? – twhale Apr 30 '18 at 09:23
  • @Evert: Have corrected your point in the question. – twhale Apr 30 '18 at 09:30
  • As far as I can see, you still exit within the first iteration: the if-else statement returns on either branch, thus never letting the loop go to its next iteration. –  Apr 30 '18 at 10:07
  • @Evert: Yes, but what you point out is the key problem. I do not know how to solve that. What I meant with 'corrected your point' is that I have updated that the function loops over all `r`'s but not all `b`'s (I had written that in correctly). – twhale Apr 30 '18 at 10:11
  • @Goyo: The contents of both DataFrames `df1['Sentence']` and `df2['First2']` is written at the top of my question. There are no other DataFrames. – twhale Apr 30 '18 at 10:13
  • @twhale To solve that do not return until the `for` loop is completed. – Stop harming Monica Apr 30 '18 at 10:20
  • @Goyo I am relatively new to programming and understand what you say in principle. But can you show what to change in my code to implement your suggestion? – twhale Apr 30 '18 at 10:29
  • I have found the error in my code and posted the solution below. It implements your suggestions. – twhale May 01 '18 at 04:50

2 Answers2

0

I'm not sure if I'm getting this right, but it looks like you want to store all the phrases from 'First2' (in say a variable s), and have a column 'Clauses' that is the remainder of the string after any match with any of the phrases contained in s.

There's probably a more efficient method, but a here's hacky way to do this with regular expressions:

# build the capturing string
s = '(' + '|'.join(df.First2[df.First2 != ''].values + '.*') + ')'
# use the pandas Series.str method to extract, and assign to new column
df['Clauses'] = df.Sentence.str.extract(s, expand = False)
Ken Wei
  • 3,020
  • 1
  • 10
  • 30
  • I would like to have a column `Clauses` that contains the sentence segments that begin with the First2 words, all the way to the end of the sentence. The segment needs to include the First2 words also. I have updated the comments in the code to hopefully make this more clear. – twhale Apr 30 '18 at 09:51
  • I think what your answer does is select any sentence segment that contains the words in `First2`, leaving the words themselves out. But I want them in and also, I want the selection to be case sensitive. So if `First2` is in lowercase, only segments should be selected in which the words are lowercase. – twhale Apr 30 '18 at 09:58
  • Have you actually tried running the code in my answer? – Ken Wei May 01 '18 at 04:43
  • Yes of course. But it did not give the desired result. I have posted the solution below. There was a mistake in my original code. – twhale May 01 '18 at 04:48
0

This code answers my question:

import pandas as pd    
a = df1['Sentence']
b = df2['First2'] 

def func(r):
    for i in b:
        if i in r:
            q = r[r.index(i):]
            return q
    return ''

df['Segments'] = a.apply(func)

It was pointed out by Daming Lu here: How to select sub-strings based on the presence of word pairs? Python Hope this helps others.

twhale
  • 725
  • 2
  • 9
  • 25