Extracting phrases from a sentence

Question

On the one hand, there're phrases, on the other hand there are plenty of sentences that that should be checked for having such phrase with showing position of each word (index_start, index_end).

For example,

phrase: "red moon rises"
sentence: "red moon and purple moon are rises"
result: 

1) ["red" (0, 3), "moon" (4, 8), "rises" (29,34)] 
2) ["red" (0, 3), "moon" (20, 24), "rises" (29,34)]

Here, we have 2 different words "moon"

Another example,

phrase: "Sonic collect rings"
sentence: "Not only Sonic likes to collect rings, Tails likes to collect rings too"
result:

1) ["Sonic" (9, 14), "collect" (24, 31), "rings" (32,37)] 
2) ["Sonic" (9, 14), "collect" (24, 31), "rings" (62,67)]
3) ["Sonic" (9, 14), "collect" (54, 61), "rings" (62,67)]

The last example,

phrase: "be smart"
sentence: "Donald always wanted to be clever and to be smart"
result: 

1) ["be" (24, 26), "smart" (44, 49)]
2) ["be" (41, 43), "smart" (44, 49)]

I tried to regex around it, something like 'sonic.*collects.*rings' or non-greedy variant 'sonic.*?collects.*?rings'. But such solutions give only 1) and 3) results.

Also I gave a try to the third-party regex module using positive look-behind: '(?<=(Sonic.*collect.*rings))', but it gives only 2 of 3 captures.

Some code for sonic example:

import re

# sonic example, extracting all results
text = ['Sonic', 'collect', 'rings']
builded_regex = '.*'.join([r'\b({})\b'.format(word) for word in text])
for result in re.finditer(builded_regex, 'Not only Sonic likes to collect rings, Tails likes to collect rings too'):
    for i, word in enumerate(text):
        print('"{}" {}'.format(word, result.regs[i + 1]), end=' ')
    print('')

Output:

"Sonic" (9, 14) "collect" (54, 61) "rings" (62, 67)

What's the best solution to such task and I wonder if there's solution to solve it using regex?

What does these results mean? `["be" (24, 2), "smart" (44, 5)]` what is the relation between those coordinates and the words? — Lucas Wieloch, Jul 25 '19 at 06:52
Besides... `"red" (0, 3)` what is this? This is not python. What kind of structure is that — Lucas Wieloch, Jul 25 '19 at 06:54
Do the words have to be found in order, or is out of order fine? — Zachary822, Jul 25 '19 at 06:58

score 0 · Accepted Answer · answered Jul 25 '19 at 07:20

import re
from itertools import product
from operator import itemgetter

phrase = "red moon rises".split()  # split into words

search_space = "red moon and purple moon are rises"

all_word_locs = []

for word in phrase:
    word_locs = []
    for match in re.finditer(word, search_space):  # find *all* occurances of word in the whole string
        s, e = match.span()
        word_locs.append((word, s, e - s))  # save the word and its location

    all_word_locs.append((word_locs))  # gather all the found locations of each word

cart_prod = product(*all_word_locs)  # use the cartesian product to find all combinations

for found in cart_prod:
    locs = list(map(itemgetter(1), found))  # get the location of each found word
    if all(x < y for x, y in zip(locs, locs[1:])):
        print(found)  # only print if the words are found in order

*I'm using this to check if the location of the word is in order.

score 0 · Answer 2 · answered Jul 25 '19 at 07:22

Try something like (I didn't write in python):

regex reg = "/(Sonic).*(collect).*(rings)/i"
if(reg.match(myString).success)
    myString.find("Sonic")....

First, find if the phrase exist in the sentence, and in the right order.

Then, catch all the references of every word.

Extracting phrases from a sentence

2 Answers2