On the one hand, there're phrases, on the other hand there are plenty of sentences that that should be checked for having such phrase with showing position of each word (index_start, index_end).
For example,
phrase: "red moon rises"
sentence: "red moon and purple moon are rises"
result:
1) ["red" (0, 3), "moon" (4, 8), "rises" (29,34)]
2) ["red" (0, 3), "moon" (20, 24), "rises" (29,34)]
Here, we have 2 different words "moon"
Another example,
phrase: "Sonic collect rings"
sentence: "Not only Sonic likes to collect rings, Tails likes to collect rings too"
result:
1) ["Sonic" (9, 14), "collect" (24, 31), "rings" (32,37)]
2) ["Sonic" (9, 14), "collect" (24, 31), "rings" (62,67)]
3) ["Sonic" (9, 14), "collect" (54, 61), "rings" (62,67)]
The last example,
phrase: "be smart"
sentence: "Donald always wanted to be clever and to be smart"
result:
1) ["be" (24, 26), "smart" (44, 49)]
2) ["be" (41, 43), "smart" (44, 49)]
I tried to regex around it, something like 'sonic.*collects.*rings'
or non-greedy variant 'sonic.*?collects.*?rings'
. But such solutions give only 1) and 3) results.
Also I gave a try to the third-party regex
module using positive look-behind: '(?<=(Sonic.*collect.*rings))'
, but it gives only 2 of 3 captures.
Some code for sonic example:
import re
# sonic example, extracting all results
text = ['Sonic', 'collect', 'rings']
builded_regex = '.*'.join([r'\b({})\b'.format(word) for word in text])
for result in re.finditer(builded_regex, 'Not only Sonic likes to collect rings, Tails likes to collect rings too'):
for i, word in enumerate(text):
print('"{}" {}'.format(word, result.regs[i + 1]), end=' ')
print('')
Output:
"Sonic" (9, 14) "collect" (54, 61) "rings" (62, 67)
What's the best solution to such task and I wonder if there's solution to solve it using regex?