This uses Python's best-in-class list slicing. phrase[::2]
creates a list slice consisting of the 0th, 2nd, 4th, 6th... elements of a list. This is the basis of the following solution.
For each phrase, a |
symbol is put either side of found phrases. The following shows 'this is'
being marked in 'hello this is me'
'hello this is me' -> 'hello|this is|me'
When the text is split on |
:
['hello', 'this is', 'me']
the even-numbered elements [::2]
are non-matches, the odd elements [1::2]
are the matched phrases:
0 1 2
unmatched: ['hello', 'me']
matched: 'this is',
If there are different numbers of matched and unmatched elements in the segment, the gaps are filled with empty strings using zip_longest
so that there is always a balanced pair of unmatched and matched text:
0 1 2 3
unmatched: ['hello', 'me', ]
matched: 'this is', ''
For each phrase, the previously unmatched (even-numbered) elements of the text are scanned, the phrase (if found) delimited with |
and the results merged back into the segmented text.
The matched and unmatched segments are merged back into the segmented text using zip()
followed by flatten()
, taking care to maintain the even (unmatched) and odd (matched) indexes of new and existing text segments. The newly-matched phrases are merged back in as odd-numbered elements, so they will not be scanned again for embedded phrases. This prevents conflict between phrases with similar wording like "this is" and "this".
flatten()
is used everywhere. It finds sub-lists embedded in a larger list and flattens their contents down into the main list:
['outer list 1', ['inner list 1', 'inner list 2'], 'outer list 2']
becomes:
['outer list 1', 'inner list 1', 'inner list 2', 'outer list 2']
This is useful for collecting phrases from multiple embedded lists, as well as merging split or zipped sublists back into the segmented text:
[['the quick brown fox says', ''], ['hello', 'this is', 'me', '']] ->
['the quick brown fox says', '', 'hello', 'this is', 'me', ''] ->
0 1 2 3 4 5
unmatched: ['the quick brown fox says', 'hello', 'me', ]
matched: '', 'this is', '',
At the very end, the elements that are empty strings, which were just for even-odd alignment, can be removed:
['the quick brown fox says', '', 'hello', 'this is', '', 'me', ''] ->
['the quick brown fox says', 'hello', 'this is', 'me']
texts = [['hello this is me'], ['oh you know u'],
['the quick brown fox says hello this is me']]
phrases_to_match = [['this is', 'u'], ['oh you', 'you', 'me']]
from itertools import zip_longest
def flatten(string_list):
flat = []
for el in string_list:
if isinstance(el, list) or isinstance(el, tuple):
flat.extend(el)
else:
flat.append(el)
return flat
phrases_to_match = flatten(phrases_to_match)
# longer phrases are given priority to avoid problems with overlapping
phrases_to_match.sort(key=lambda phrase: -len(phrase.split()))
segmented_texts = []
for text in flatten(texts):
segmented_text = text.split('|')
for phrase in phrases_to_match:
new_segments = segmented_text[::2]
delimited_phrase = f'|{phrase}|'
for match in [f' {phrase} ', f' {phrase}', f'{phrase} ']:
new_segments = [
segment.replace(match, delimited_phrase)
for segment
in new_segments
]
new_segments = flatten([segment.split('|') for segment in new_segments])
segmented_text = new_segments if len(segmented_text) == 1 else \
flatten(zip_longest(new_segments, segmented_text[1::2], fillvalue=''))
segmented_text = [segment for segment in segmented_text if segment.strip()]
# option 1: unmatched text is split into words
segmented_text = flatten([
segment if segment in phrases_to_match else segment.split()
for segment
in segmented_text
])
segmented_texts.append(segmented_text)
print(segmented_texts)
Results:
[['hello', 'this is', 'me'], ['oh you', 'know', 'u'],
['the', 'quick', 'brown', 'fox', 'says', 'hello', 'this is', 'me']]
Notice that the phrase 'oh you' has taken precedence over the subset phrase 'you' and there is no conflict.