0

On the one hand, there're phrases, on the other hand there are plenty of sentences that that should be checked for having such phrase with showing position of each word (index_start, index_end).

For example,

phrase: "red moon rises"
sentence: "red moon and purple moon are rises"
result: 

1) ["red" (0, 3), "moon" (4, 8), "rises" (29,34)] 
2) ["red" (0, 3), "moon" (20, 24), "rises" (29,34)]

Here, we have 2 different words "moon"

Another example,

phrase: "Sonic collect rings"
sentence: "Not only Sonic likes to collect rings, Tails likes to collect rings too"
result:

1) ["Sonic" (9, 14), "collect" (24, 31), "rings" (32,37)] 
2) ["Sonic" (9, 14), "collect" (24, 31), "rings" (62,67)]
3) ["Sonic" (9, 14), "collect" (54, 61), "rings" (62,67)]

The last example,

phrase: "be smart"
sentence: "Donald always wanted to be clever and to be smart"
result: 

1) ["be" (24, 26), "smart" (44, 49)]
2) ["be" (41, 43), "smart" (44, 49)]

I tried to regex around it, something like 'sonic.*collects.*rings' or non-greedy variant 'sonic.*?collects.*?rings'. But such solutions give only 1) and 3) results.

Also I gave a try to the third-party regex module using positive look-behind: '(?<=(Sonic.*collect.*rings))', but it gives only 2 of 3 captures.

Some code for sonic example:

import re

# sonic example, extracting all results
text = ['Sonic', 'collect', 'rings']
builded_regex = '.*'.join([r'\b({})\b'.format(word) for word in text])
for result in re.finditer(builded_regex, 'Not only Sonic likes to collect rings, Tails likes to collect rings too'):
    for i, word in enumerate(text):
        print('"{}" {}'.format(word, result.regs[i + 1]), end=' ')
    print('')

Output:

"Sonic" (9, 14) "collect" (54, 61) "rings" (62, 67) 

What's the best solution to such task and I wonder if there's solution to solve it using regex?

2 Answers2

0
import re
from itertools import product
from operator import itemgetter

phrase = "red moon rises".split()  # split into words

search_space = "red moon and purple moon are rises"

all_word_locs = []

for word in phrase:
    word_locs = []
    for match in re.finditer(word, search_space):  # find *all* occurances of word in the whole string
        s, e = match.span()
        word_locs.append((word, s, e - s))  # save the word and its location

    all_word_locs.append((word_locs))  # gather all the found locations of each word

cart_prod = product(*all_word_locs)  # use the cartesian product to find all combinations

for found in cart_prod:
    locs = list(map(itemgetter(1), found))  # get the location of each found word
    if all(x < y for x, y in zip(locs, locs[1:])):
        print(found)  # only print if the words are found in order

*I'm using this to check if the location of the word is in order.

Zachary822
  • 2,873
  • 2
  • 11
  • 9
0

Try something like (I didn't write in python):

regex reg = "/(Sonic).*(collect).*(rings)/i"
if(reg.match(myString).success)
    myString.find("Sonic")....

First, find if the phrase exist in the sentence, and in the right order.

Then, catch all the references of every word.