2

I have a large number of sentences, from which I want to extract sub-sentences that start with certain word combinations. For example, I want to extract sentence segments that begin with "what does" or "what is', etc. (essentially eliminating the words from the sentence that appear before the word-pairs). Both the sentences and the word-pairs are stored in a DataFrame:

'Sentence'                                    'First2'                                    
0  If this is a string what does it say?      0 can I    
1  And this is a string, should it say more?  1 should it    
2  This is yet another string.                2 what does
3  etc. etc.                                  3 etc. etc

The result I want from the above example would be:

0 what does it say?
1 should it say more?
2

The most obvious solution (at least to me) below does not work. It only uses the first word-pair b to go over all the sentences r, but not the other b's.

a = df['Sentence']
b = df['First2'] 

#The function seems to loop over all r's but only over the first b:
def func(z): 
    for x in b:
        if x in r:
            s = z[z.index(x):] 
            return s
        else:
            return ‘’

df['Segments'] = a.apply(func)

It seems that looping over two DataFrames simultaneously in this way does not work. Is there a more efficient and effective way to do this?

twhale
  • 725
  • 2
  • 9
  • 25
  • Accumulate `s` in a container and don't `return` till the loop completes. There is a duplicate Q&A for this somewhere. – wwii Apr 30 '18 at 18:44
  • Possible duplicate of [python for-loop only executes once?](https://stackoverflow.com/questions/41933378/python-for-loop-only-executes-once) – wwii Apr 30 '18 at 18:48
  • Adding a container and not doing `return` until the loop completes works. But the results look strange: `[, , , , , , , , what does that say?,...` It seems the deleted words are replaced by (empty) elements in a list, while the selected text as a whole becomes an element in the list. – twhale Apr 30 '18 at 19:16
  • One question at a time. Refactor and repost if there is a problem you can't solve in the *new* code.. – wwii Apr 30 '18 at 19:17

3 Answers3

1

you can loop over two things easily via zip(iterator,iterator_foo)

J. Doe
  • 23
  • 1
  • 1
  • 10
1

I believe there is a bug in your code.

else:
    return ''

This means if the 1st comparison is not a match, 'func' will return immediately. That might be why the code does not return any matches.

A sample working code is below:

# The function seems to loop over all r's but only over the first b:
def func(sentence, first_twos=b):
    for first_two in first_twos:
        if first_two in sentence:
            s = sentence[sentence.index(first_two):]
            return s
    return ''

df['Segments'] = a.apply(func)

And the output:

df:   
{   
'First2': ['can I', 'should it', 'what does'],   
'Segments': ['what does it say? ', 'should it say more?', ''],   
'Sentence': ['If this is a string what does it say? ', 'And this is a string, should it say more?', 'This is yet another string.  '  ]  
} 
Daming Lu
  • 316
  • 1
  • 5
  • 21
  • You are right about the bug in my code! When I replace it by the last two lines of your answer (`return s` and `return ''`) I get the desired result. However, `.apply` works in my Python code. I only replace the two return lines, nothing else, and now it works. Your suggested last line does not work in my code. If you could replace it by: `df['Segments'] = a.apply(func)` then I can accept your answer. Thanks! – twhale Apr 30 '18 at 19:35
  • @twhale: modified. Thanks :). Also the `apply` comes from `pandas.DataFrame.apply`. – Daming Lu Apr 30 '18 at 20:47
0

My question was answered by the following code:

def func(r):
    for i in b:
        if i in r:
            q = r[r.index(i):]
            return q
    return ''

df['Segments'] = a.apply(func)

The solution was pointed out here by Daming Lu (only the last line is different from his). The problem was in the last two lines of the original code:

else:
    return ''  

This caused the function to return too early. Daming Lu's answer was better than the answer to the possible duplicate question python for-loop only executes once? which created other problems - as explained in my respons to wii. (So I am not sure mine really is a duplicate.)

twhale
  • 725
  • 2
  • 9
  • 25