In some NLP task I have a nested list of strings:
[['Start', 'двигаться', 'другая', 'сторона', 'света', 'надолго', 'скоро'],
['Start', 'двигаться', 'другая', 'сторона', 'света', 'чтобы', 'посмотреть'],
['Start', 'двигаться', 'новая', 'планета'],
['Start', 'двигаться', 'сторона', 'признание', 'суверенитет', 'израильский'],
['Start', 'двигаться', 'сторона', 'признание', 'высот', 'на'],
['Start', 'двигаться', 'сторона', 'признание', 'высот', 'оккупировать'],
['Start', 'двигаться', 'сторона', 'признание', 'высот', 'Голанский'],
['Start', 'двигаться', 'сторона', 'признание', 'и']]
I need an algorithm to find two or more elements, which are common for two or more sublists and make a single element from them. in my example, 'Start', 'двигаться'
is common for all elements, so it should become single string. 'сторона', 'света', 'надолго'
is common for two elements, so it become single string. 'сторона', 'признание'
is common for 5 elements, so it become single string. If there are no common elements left, just add the rest elements as a single string.
Desired output:
[['Start двигаться', 'другая сторона света', 'надолго скоро'],
['Start двигаться', 'другая сторона света', 'чтобы посмотреть'],
['Start двигаться', 'новая планета'],
['Start двигаться', 'сторона признание', 'суверенитет израильский'],
['Start двигаться', 'сторона признание', 'высот на'],
['Start двигаться', 'сторона признание', 'высот оккупировать'],
['Start двигаться', 'сторона признание', 'высот Голанский'],
['Start двигаться', 'сторона признание', 'и']]
So far I tried some loops and element comparison:
for elem,next_elem in zip(lst, lst[1:]+[lst[0]]):
if elem[0] == next_elem[0] and elem[1] == next_elem[1] and elem[2] == next_elem[2]:
elem[0:3] = [' '.join(elem[0:3])]
if elem[0] == next_elem[0] and elem[1] == next_elem[1]:
elem[0:2] = [' '.join(elem[0:2])]
But I don't think that's the right way. Sets are also not an option since there can be multiple occurrences of one element in the sublist. I checked other LCS topics but didn't find a solution. Any working algorithm that does the job will be great, efficiency is unimportant at the moment. Some more examples:
[[a,b,c,d],
[a,b,d,e,f]]
Should become:
[[ab,cd],
[ab,def]]
Since a,b
are common element, and cd, def
just become single element.
[[a,b,c,d,e,g],
[a,b,c,d,g,h],
[a,b,h,h,i]]
Should become:
[[ab,cd,eg],
[ab,cd,gh],
[ab,hhi]]
Since ab
and cd
are cannon for two or more sublists
And:
[[a,b,c],
[a,b,d]]
Becomes:
[[ab, c],
[ab, d]]
Since c, d
are not common elements