Find sequences of words in sublists in Python

Question

In some NLP task I have a nested list of strings:

    [['Start', 'двигаться', 'другая', 'сторона', 'света', 'надолго', 'скоро'], 
     ['Start', 'двигаться', 'другая', 'сторона', 'света', 'чтобы', 'посмотреть'],
     ['Start', 'двигаться', 'новая', 'планета'],
     ['Start', 'двигаться', 'сторона', 'признание', 'суверенитет', 'израильский'],
     ['Start', 'двигаться', 'сторона', 'признание', 'высот', 'на'],
     ['Start', 'двигаться', 'сторона', 'признание', 'высот', 'оккупировать'],
     ['Start', 'двигаться', 'сторона', 'признание', 'высот', 'Голанский'],
     ['Start', 'двигаться', 'сторона', 'признание', 'и']]

I need an algorithm to find two or more elements, which are common for two or more sublists and make a single element from them. in my example, 'Start', 'двигаться' is common for all elements, so it should become single string. 'сторона', 'света', 'надолго' is common for two elements, so it become single string. 'сторона', 'признание' is common for 5 elements, so it become single string. If there are no common elements left, just add the rest elements as a single string. Desired output:

    [['Start двигаться', 'другая сторона света', 'надолго скоро'], 
     ['Start двигаться', 'другая сторона света', 'чтобы посмотреть'],
     ['Start двигаться', 'новая планета'],
     ['Start двигаться', 'сторона признание', 'суверенитет израильский'],
     ['Start двигаться', 'сторона признание', 'высот на'],
     ['Start двигаться', 'сторона признание', 'высот оккупировать'],
     ['Start двигаться', 'сторона признание', 'высот Голанский'],
     ['Start двигаться', 'сторона признание', 'и']]

So far I tried some loops and element comparison:

for elem,next_elem in zip(lst, lst[1:]+[lst[0]]):
    if elem[0] == next_elem[0] and elem[1] == next_elem[1] and elem[2] == next_elem[2]:
        elem[0:3] = [' '.join(elem[0:3])]

    if elem[0] == next_elem[0] and elem[1] == next_elem[1]:
        elem[0:2] = [' '.join(elem[0:2])]

But I don't think that's the right way. Sets are also not an option since there can be multiple occurrences of one element in the sublist. I checked other LCS topics but didn't find a solution. Any working algorithm that does the job will be great, efficiency is unimportant at the moment. Some more examples:

[[a,b,c,d],
 [a,b,d,e,f]]

Should become:

[[ab,cd],
 [ab,def]]

Since a,b are common element, and cd, def just become single element.

[[a,b,c,d,e,g],
[a,b,c,d,g,h],
[a,b,h,h,i]]

Should become:

[[ab,cd,eg],
 [ab,cd,gh],
 [ab,hhi]]

Since ab and cd are cannon for two or more sublists

And:

[[a,b,c],
 [a,b,d]]

Becomes:

[[ab, c],
 [ab, d]]

Since c, d are not common elements

Look [here](https://stackoverflow.com/questions/1388818/how-can-i-compare-two-lists-in-python-and-return-matches). — Tristhal, Jun 04 '18 at 13:13
Waht if you have `[[a, b], [a, b, c], [b, c]]`? Also, why is `'и'` joined to the strings before, but `'высот'` is not? — tobias_k, Jun 04 '18 at 14:11
@tobias_k Thanks for your reply! In that case output will be `[[a b], [a b c], [b c]]` since 'c' will be single left element and should be added to previous sequence and 'b c' is unique element. — Alex Nikitin, Jun 04 '18 at 14:15

score 1 · Answer 1 · answered Jun 04 '18 at 13:52

I suggest you use hashmaps key:word, value: Integer as counter, starts as 0. (This is dictionary in python). For every line, hash each values and increase the counter. At the end, for every word that has a counter of 2 or more, you concatenate them.

I've left out code and the part where you concatenate only strings with same counter, and repitition since this seems like homework.

tobias_k · Answer 2 · 2018-06-05T11:16:27.740

You could start with creating a prefix-tree representing your lists:

lists = [['Start', 'двигаться', 'другая', 'сторона', 'света', 'надолго', 'скоро'], 
         ['Start', 'двигаться', 'другая', 'сторона', 'света', 'чтобы', 'посмотреть'],
         ['Start', 'двигаться', 'новая', 'планета'],
         ['Start', 'двигаться', 'сторона', 'признание'],
         ['Start', 'двигаться', 'сторона', 'признание', 'суверенитет', 'израильский'],
         ['Start', 'двигаться', 'сторона', 'признание', 'высот', 'на'],
         ['Start', 'двигаться', 'сторона', 'признание', 'высот', 'оккупировать'],
         ['Start', 'двигаться', 'сторона', 'признание', 'высот', 'Голанский'],
         ['Start', 'двигаться', 'сторона', 'признание', 'и']]

tree = {}
end = "END"
for lst in lists:
    d = tree
    for x in lst:
        d = d.setdefault(x, {})
    d[end] = {}

Result (here, END marks where a sentence has ended):

{'Start': {'двигаться': {'другая': {'сторона': {'света': {'надолго': {'скоро': {'END': {}}},
                                                          'чтобы': {'посмотреть': {'END': {}}}}}},
                         'новая': {'планета': {'END': {}}},
                         'сторона': {'признание': {'END': {},
                                                   'высот': {'Голанский': {'END': {}},
                                                             'на': {'END': {}},
                                                             'оккупировать': {'END': {}}},
                                                   'и': {'END': {}},
                                                   'суверенитет': {'израильский': {'END': {}}}}}}}}

Now, you can recursively traverse that tree, and whenever a node has only a single child (a sub-dict with just a single element), join those nodes.

def join(d, pref=[]):
    if end in d:
        yield [' '.join(pref)] if pref else []
    for k, v in d.items():
        if len(v) == 1:
            for x in join(v, pref + [k]): # add node to prefix
                yield x                   # yield next segment
        else:
            for x in join(v, []):         # reset prefix
                yield [' '.join(pref + [k])] + x # yield node + prefix and next

Output is not exactly as in your question, but very close. It will join all the parts that have only a single child in the tree, i.e. afterwards segments should be maximal while no segment is part of a longer segment.

>>> for x in join(tree):
...     print(x)
...
['Start двигаться', 'другая сторона света', 'надолго скоро']
['Start двигаться', 'другая сторона света', 'чтобы посмотреть']
['Start двигаться', 'новая планета']
['Start двигаться', 'сторона признание']
['Start двигаться', 'сторона признание', 'суверенитет израильский']
['Start двигаться', 'сторона признание', 'высот', 'на']
['Start двигаться', 'сторона признание', 'высот', 'оккупировать']
['Start двигаться', 'сторона признание', 'высот', 'Голанский']
['Start двигаться', 'сторона признание', 'и']

Here's an illustration of the tree-based approach. Colors indicate parts without any branching that will be merged; end-nodes are bold (those do not have to be leaf-nodes).

@AlexNikitin Thanks, although I have the feeling that `join` function is more complicated than necessary. Will revisit later... — tobias_k, Jun 04 '18 at 14:43
Broke the algorithm by changing `новая планета` to `сторона планета`: `[['Start двигаться', 'другая сторона света', 'надолго скоро'], ['Start двигаться', 'другая сторона света', 'чтобы посмотреть'], ['Start двигаться', 'сторона планета'], ['Start двигаться', 'сторона', 'признание', 'суверенитет израильский'], ['Start двигаться', 'сторона', 'признание', 'высот на'], ['Start двигаться', 'сторона', 'признание', 'высот оккупировать'], ['Start двигаться', 'сторона', 'признание', 'высот Голанский'], ['Start двигаться', 'сторона', 'признание и']]` — Alex Nikitin, Jun 05 '18 at 06:46
@AlexNikitin Can you check again? At least I can't reproduce that problem any longer. — tobias_k, Jun 05 '18 at 11:17
Thanks again, visualisation is really great! It seems that there is some problem in tree building. if I change `'новая', 'планета'` to `'сторона', 'планета'` in input, `'сторона', 'признание'` become separated elements in the output. can reproduce with similar cases, I double-checked. — Alex Nikitin, Jun 05 '18 at 11:30
@AlexNikitin Ah, okay, then I misread your previous comment (or rather the modification you made). But that would be to be expected using this approach, as then the `сторона` node has more than one children. TBH, I think I still haven't entirely understood your rules for joining segments. E.g., why to join `'высот на'`? Could you elaborate again? (preferrably in the question itself, not in this comment thread) — tobias_k, Jun 05 '18 at 11:42
Thanks for your time! In my post there is: algorithm to find **two or more elements**, which are common for two or more sublists. so single element can't become node. and about `'высот на'`: If there are no common elements left, just add the rest elements as a single string. since `высот` is a single common element, it can't form node, so it just become one string with the rest words in sequence. Thanks again, you really help our little project! — Alex Nikitin, Jun 05 '18 at 12:28
@AlexNikitin I still don't understand all of this, and to be honest, I am not sure that you yourself really understand what you want _exactly_ -- if I'm wrong, please elaborate further _in the question itself_, preferrably with some more examples. For instance, how should `[[a, b, c], [a, b, d],[ a, b, e]]` be grouped? As [[ab, c], [ab, d], [ab,e]]`, or `[[abc], [abd], [abe]]`, as there should be no single elements? — tobias_k, Jun 05 '18 at 18:37
Edited the part with single elements. It will be more easy to add them after building the tree, anyway. Added more examples, hope it helps! — Alex Nikitin, Jun 06 '18 at 06:46
In simple words I can say that we need a tree as you created, but with minimal length of node of 2 elements. If there are no 2 or more common elements left, add the rest as single element. — Alex Nikitin, Jun 06 '18 at 09:45

Find sequences of words in sublists in Python

2 Answers2