6

I have two lists of lists:

text = [['hello this is me'], ['oh you know u']]
phrases = [['this is', 'u'], ['oh you', 'me']]

I need to split the text making word combinations present in phrases a single string:

result = [['hello', 'this is', 'me'], ['oh you', 'know', 'u']

I tried using zip() but it iterates through the list consecutively, while I need to check each and every list. I also tried a find() method but from this example it would also find all letters 'u' and make them a string (like in word 'you' it makes it 'yo', 'u'). I wish replace() worked when replacing a string with a list too, because it would let me do something like:

for line in text:
        line = line.replace('this is', ['this is'])

But trying everything, I still haven't found anything that works for me in this situation. Can you help me with that?

kat678
  • 89
  • 5
  • I don't understand why there is a list wrapping each individual string in the input, instead of just having `['hello this is me', 'oh you know u']`. – Karl Knechtel Jan 24 '21 at 05:31
  • Should longer phrases override shorter ones? – Nic Jan 24 '21 at 05:44
  • @KarlKnechtel it doesn't have to be there if it helps to solve the problem but initially text is divided into multiple lists by paragraphs – kat678 Jan 24 '21 at 13:02
  • @Nick if you're talking about repetitions of phrases, like 'this is' and 'this', it shouldn't be the case because phrases are unique. But if that happens, yes, longer phrases override shorter ones – kat678 Jan 24 '21 at 13:06

4 Answers4

3

Clarified with original poster:

Given the text pack my box with five dozen liquor jugs and the phrase five dozen

the result should be:

['pack', 'my', 'box', 'with', 'five dozen', 'liquor', 'jugs']

not:

['pack my box with', 'five dozen', 'liquor jugs']

Each text and phrase is converted to a Python list of words ['this', 'is', 'an', 'example'] which prevents 'u' being matched inside a word.

All possible subphrases of the text are generated by compile_subphrases(). Longer phrases (more words) are generated first so they are matched before shorter ones. 'five dozen jugs' would always be matched in preference to 'five dozen' or 'five'.

Phrase and subphrase are compared using list slices, roughly like this:

    text = ['five', 'dozen', 'liquor', 'jugs']
    phrase = ['liquor', 'jugs']
    if text[2:3] == phrase:
        print('matched')

Using this method for comparing phrases, the script walks through the original text, rewriting it with the phrases picked out.

texts = [['hello this is me'], ['oh you know u']]
phrases_to_match = [['this is', 'u'], ['oh you', 'me']]
from itertools import chain

def flatten(list_of_lists):
    return list(chain(*list_of_lists))

def compile_subphrases(text, minwords=1, include_self=True):
    words = text.split()
    text_length = len(words)
    max_phrase_length = text_length if include_self else text_length - 1
    # NOTE: longest phrases first
    for phrase_length in range(max_phrase_length + 1, minwords - 1, -1):
        n_length_phrases = (' '.join(words[r:r + phrase_length])
                            for r in range(text_length - phrase_length + 1))
        yield from n_length_phrases
        
def match_sublist(mainlist, sublist, i):
    if i + len(sublist) > len(mainlist):
        return False
    return sublist == mainlist[i:i + len(sublist)]

phrases_to_match = list(flatten(phrases_to_match))
texts = list(flatten(texts))
results = []
for raw_text in texts:
    print(f"Raw text: '{raw_text}'")
    matched_phrases = [
        subphrase.split()
        for subphrase
        in compile_subphrases(raw_text)
        if subphrase in phrases_to_match
    ]
    phrasal_text = []
    index = 0
    text_words = raw_text.split()
    while index < len(text_words):
        for matched_phrase in matched_phrases:
            if match_sublist(text_words, matched_phrase, index):
                phrasal_text.append(' '.join(matched_phrase))
                index += len(matched_phrase)
                break
        else:
            phrasal_text.append(text_words[index])
            index += 1
    results.append(phrasal_text)
print(f'Phrases to match: {phrases_to_match}')
print(f"Results: {results}")

Results:

$python3 main.py
Raw text: 'hello this is me'
Raw text: 'oh you know u'
Phrases to match: ['this is', 'u', 'oh you', 'me']
Results: [['hello', 'this is', 'me'], ['oh you', 'know', 'u']]

For testing this and other answers with larger datasets, try this at the start of the code. It generates 100s of variations on a single long sentence to simulate 100s of texts.

from itertools import chain, combinations
import random

#texts = [['hello this is me'], ['oh you know u']]
theme = ' '.join([
    'pack my box with five dozen liquor jugs said',
    'the quick brown fox as he jumped over the lazy dog'
])
variations = list([
    ' '.join(combination)
    for combination
    in combinations(theme.split(), 5)
])
texts = random.choices(variations, k=500)
#phrases_to_match = [['this is', 'u'], ['oh you', 'me']]
phrases_to_match = [
    ['pack my box', 'quick brown', 'the quick', 'brown fox'],
    ['jumped over', 'lazy dog'],
    ['five dozen', 'liquor', 'jugs']
]
Nic
  • 1,518
  • 12
  • 26
  • It should have given the first output but anyway, your answers helped me a lot with my problem, so thank you! – kat678 Feb 09 '21 at 23:17
  • Great, I'm glad they helped. Option 1 (split unmatched text into separate words) is now implemented in both answers. – Nic Feb 10 '21 at 20:50
2

Try this out.

import re

def filter_phrases(phrases):
    phrase_l = sorted(phrases, key=len)
    
    for i, v in enumerate(phrase_l):
        for j in phrase_l[i + 1:]:
            if re.search(rf'\b{v}\b', j):
                phrases.remove(v)
    
    return phrases


text = [
    ['hello this is me'], 
    ['oh you know u'],
    ['a quick brown fox jumps over the lazy dog']
]
phrases = [
    ['this is', 'u'], 
    ['oh you', 'me'],
    ['fox', 'brown fox']
]

# Flatten the `text` and `phrases` list
text = [
    line for l in text 
    for line in l
]
phrases = {
    phrase for l in phrases 
    for phrase in l
}

# If you're quite sure that your phrase
# list doesn't have any overlapping 
# zones, then I strongly recommend 
# against using this `filter_phrases()` 
# function.
phrases = filter_phrases(phrases)

result = []

for line in text:
    # This is the pattern to match the
    # 'space' before the phrases 
    # in the line on which the split
    # is to be done.
    l_phrase_1 = '|'.join([
        f'(?={phrase})' for phrase in phrases
        if re.search(rf'\b{phrase}\b', line)
    ])
    # This is the pattern to match the
    # 'space' after the phrases 
    # in the line on which the split
    # is to be done.
    l_phrase_2 = '|'.join([
        f'(?<={phrase})' for phrase in phrases
        if re.search(rf'\b{phrase}\b', line)
    ])
    
    # Now, we combine the both patterns
    # `l_phrase_1` and `l_phrase_2` to
    # create our master regex. 
    result.append(re.split(
        rf'\s(?:{l_phrase_1})|(?:{l_phrase_2})\s', 
        line
    ))
    
print(result)

# OUTPUT (PRETTY FORM)
#
# [
#     ['hello', 'this is', 'me'], 
#     ['oh you', 'know', 'u'], 
#     ['a quick', 'brown fox', 'jumps over the lazy dog']
# ]

Here, I've used re.split to split before or after phrase in the text.

Tsubasa
  • 1,389
  • 11
  • 21
  • if one the text is `hi how are you doing good you know` and with this `phrases = [['this is', 'u'], ['oh you', 'me'], ['hi how', 'are','good you']]` the result is weird like `['hi how', 'are', 'you', 'doing', 'good you', 'know']` ... it splits on `you` – Naveen Jan 24 '21 at 09:13
  • Intriguing approach. Nice! – Nic Jan 25 '21 at 03:37
  • I'm not OP I'm afraid. Also there are some not quite right results when I run my own data through it-- I think because the order of substitution of phrases is not defined in the regex. But I liked the way you built the regex in the first place :) – Nic Jan 25 '21 at 06:25
  • phrases_to_match = [['crazy aunt', 'quick brown', 'the quick', 'brown fox'], ['jumps over', 'over the', 'the lazy', 'lazy dog'], ['whirls and whorls', 'whirls', 'whorls']] -> results = ..., ['the jumps aunt and', 'whorls'], ... ?? – Nic Jan 25 '21 at 06:27
  • @Nick What's the text list? `['A quick brown fox jumps over the lazy dog']`? – Tsubasa Jan 25 '21 at 07:09
  • Please see large dataset generator at end of my answer – Nic Jan 25 '21 at 07:11
  • Using your dataset in my solution, it is creating an overlapping region. For example for the sake of simplicity, if the text is `'A quick brown fox jumps over the lazy dog'` and the phrase list is `['brown fox', 'fox']` (order doesn't matter), it will match the spaces before and after `'brown fox'` and `'fox'` and the resultant split will be `['A quick', 'brown', 'fox', 'jumps over the lazy dog']`. This needs to be clarified by OP. – Tsubasa Jan 25 '21 at 08:02
  • One more thing that I noticed in your solution is it's giving some unexpected result if we take the above example. I explained [here](https://mystb.in/ExtractSurfCommons.properties.py) in coments below. Please check it. – Tsubasa Jan 25 '21 at 08:02
  • Thanks Xua, agree it meeds clarifying. In your example my results are as I expect – Nic Jan 25 '21 at 18:11
  • @Xua given `['fox', 'brown fox']` phrases, `'brown fox'` should be prioritised. See comments under the OP post. – Nic Jan 29 '21 at 03:47
2

This uses Python's best-in-class list slicing. phrase[::2] creates a list slice consisting of the 0th, 2nd, 4th, 6th... elements of a list. This is the basis of the following solution.

For each phrase, a | symbol is put either side of found phrases. The following shows 'this is' being marked in 'hello this is me'

'hello this is me' -> 'hello|this is|me'

When the text is split on |:

['hello', 'this is', 'me']

the even-numbered elements [::2] are non-matches, the odd elements [1::2] are the matched phrases:

                   0         1       2
unmatched:     ['hello',            'me']
matched:                 'this is',       

If there are different numbers of matched and unmatched elements in the segment, the gaps are filled with empty strings using zip_longest so that there is always a balanced pair of unmatched and matched text:

                   0         1       2     3
unmatched:     ['hello',            'me',     ]
matched:                 'this is',        ''  

For each phrase, the previously unmatched (even-numbered) elements of the text are scanned, the phrase (if found) delimited with | and the results merged back into the segmented text.

The matched and unmatched segments are merged back into the segmented text using zip() followed by flatten(), taking care to maintain the even (unmatched) and odd (matched) indexes of new and existing text segments. The newly-matched phrases are merged back in as odd-numbered elements, so they will not be scanned again for embedded phrases. This prevents conflict between phrases with similar wording like "this is" and "this".

flatten() is used everywhere. It finds sub-lists embedded in a larger list and flattens their contents down into the main list:

['outer list 1', ['inner list 1', 'inner list 2'], 'outer list 2']

becomes:

['outer list 1', 'inner list 1', 'inner list 2', 'outer list 2']

This is useful for collecting phrases from multiple embedded lists, as well as merging split or zipped sublists back into the segmented text:

[['the quick brown fox says', ''], ['hello', 'this is', 'me', '']] ->

['the quick brown fox says', '', 'hello', 'this is', 'me', ''] ->

                   0                        1       2        3          4     5
unmatched:     ['the quick brown fox says',         'hello',            'me',    ]
matched:                                    '',              'this is',       '',

At the very end, the elements that are empty strings, which were just for even-odd alignment, can be removed:

['the quick brown fox says', '', 'hello', 'this is', '', 'me', ''] ->
['the quick brown fox says', 'hello', 'this is', 'me']
texts = [['hello this is me'], ['oh you know u'],
         ['the quick brown fox says hello this is me']]
phrases_to_match = [['this is', 'u'], ['oh you', 'you', 'me']]
from itertools import zip_longest

def flatten(string_list):
    flat = []
    for el in string_list:
        if isinstance(el, list) or isinstance(el, tuple):
            flat.extend(el)
        else:
            flat.append(el)
    return flat

phrases_to_match = flatten(phrases_to_match)
# longer phrases are given priority to avoid problems with overlapping
phrases_to_match.sort(key=lambda phrase: -len(phrase.split()))
segmented_texts = []
for text in flatten(texts):
    segmented_text = text.split('|')
    for phrase in phrases_to_match:
        new_segments = segmented_text[::2]
        delimited_phrase = f'|{phrase}|'
        for match in [f' {phrase} ', f' {phrase}', f'{phrase} ']:
            new_segments = [
                segment.replace(match, delimited_phrase)
                for segment
                in new_segments
            ]
        new_segments = flatten([segment.split('|') for segment in new_segments])
        segmented_text = new_segments if len(segmented_text) == 1 else \
            flatten(zip_longest(new_segments, segmented_text[1::2], fillvalue=''))
    segmented_text = [segment for segment in segmented_text if segment.strip()]
    # option 1: unmatched text is split into words
    segmented_text = flatten([
        segment if segment in phrases_to_match else segment.split()
        for segment
        in segmented_text
    ])
    segmented_texts.append(segmented_text)
print(segmented_texts)

Results:

[['hello', 'this is', 'me'], ['oh you', 'know', 'u'],
 ['the', 'quick', 'brown', 'fox', 'says', 'hello', 'this is', 'me']]

Notice that the phrase 'oh you' has taken precedence over the subset phrase 'you' and there is no conflict.

Nic
  • 1,518
  • 12
  • 26
0

This is a quasi complete answer. Something to get you started:

ASSUMPTIONS: looking at your example, I see no reason why the phrases must remain spit, since your 2nd text is splitting on "u" which is in the first list item in "phrases".

Prep

flatten phrases "list-of-lists" into a single list. I've seen this elsewere an example

flatten = lambda t: [item for sublist in t for item in sublist if item != '']

main code:

My strategy is to look at each item in the texts list (the beginning it will just be a single item) and attempt split on a phrase in the phrases. If a split is found, a change occurs (which I mark with a flag to keep track), I substitute that list for it's split up counterpart then flatten (so it's all a single list). Then start over looping from the beginning IF a change occured (starting over because there's no way to tell if something later in the "phrases" list can also be split earlier)

flatten = lambda t: [item for sublist in t for item in sublist if item != '']

text =[['hello this is me'], ['oh you know u']]
phrases = ['this is','u','oh you', 'me']

output = []
for t in text:
    t_copy = t
    no_change=1
    while no_change:
        for i,tc in enumerate(t_copy):
            for p in phrases:
                before = [tc] # each item is a string, my output is a list, must change to list to "compare apples to apples"
                found = re.split(f'({p})',tc)
                found = [f.strip() for f in found]
                if found != before:
                    t_copy[i] = found
                    t_copy = flatten(t_copy) # flatten to avoid 
                    no_change=0
                    break
                no_change=1
        output.append(t_copy)
print(output)

comments:

I modified the flatten function to remove empty entries. I've noticed if you're splitting on something that occurs at an endpoint, an empty entry is added: ("I love u" split on "u" > ["I love", "u", ''])

Dharman
  • 30,962
  • 25
  • 85
  • 135
Tclack88
  • 101
  • 7