4

I am trying to find the longest common sequence of words in a list of sentences (more than two sentences).

Example:

list = ['commercial van for movers', 'partial van for movers', 'commercial van for moving' ]
sents = pd.Series(list)

In this answer, the solution works fine but it captures part words and it returns the follow:

'ial van for mov'

The output should be

'van for'

I couldn't find a way to modify it to return the desired output

mallet
  • 2,454
  • 3
  • 37
  • 64
  • 1
    try splitting each string into a list of words and then using the solution you provided – bunji Nov 03 '17 at 16:00
  • 1
    @bunji It's not quite as easy. The solution uses `in` to test for substrings, but you can't use `in` to test for sub-lists. – tobias_k Nov 03 '17 at 16:18
  • You must precise what is longest now : max of word counts or longest sub-chain ? – B. M. Nov 03 '17 at 16:45

2 Answers2

6

The key is to modify to search by whole-word subsequences.

from itertools import islice

def is_sublist(source, target):
    slen = len(source)
    return any(all(item1 == item2 for (item1, item2) in zip(source, islice(target, i, i+slen))) for i in range(len(target) - slen + 1))

def long_substr_by_word(data):
    subseq = []
    data_seqs = [s.split(' ') for s in data]
    if len(data_seqs) > 1 and len(data_seqs[0]) > 0:
        for i in range(len(data_seqs[0])):
            for j in range(len(data_seqs[0])-i+1):
                if j > len(subseq) and all(is_sublist(data_seqs[0][i:i+j], x) for x in data_seqs):
                    subseq = data_seqs[0][i:i+j]
    return ' '.join(subseq)

Demo:

>>> data = ['commercial van for movers',
...         'partial van for movers',
...         'commercial van for moving']
>>> long_substr_by_word(data)
'van for'
>>>
>>> data = ['a bx bx z', 'c bx bx zz']
>>> long_substr_by_word(data)
'bx bx'
Steven Rumbalski
  • 44,786
  • 9
  • 89
  • 119
  • Just joining the substring does not work. You could still find partial words. – tobias_k Nov 03 '17 at 16:42
  • Thanks for correcting me, i deleted my answer since your answer was accepted :) – Transhuman Nov 03 '17 at 17:00
  • Can you modify this to give back the longest sequence of whole words that might not be present in all of the input strings? I posted a question for this here: https://stackoverflow.com/questions/62149929/find-longest-sequence-of-common-words-from-list-of-words-in-python – Stelios M Jun 02 '20 at 10:30
0

You can created an ordered powerset of all of the subsequences of the first sentence, then search for each of those strings in the other sentences, removing the substrings are not found.

Lastly, you select the candidate substring with most spaces, and in the event of a tie, select the longest substring.

from itertools import combinations

mylist = ['commercial van for movers', 
          'partial van for movers', 
          'commercial van for moving' ]

s0 = mylist[0].split()

candidates = [' '.join(s0[slice(*c)]) for c in combinations(list(range(len(s0)+1)), 2)]
for s in mylist:
    for i,c in reversed(list(enumerate(candidates.copy()))):
        if not c in s:
            candidates.pop(i)

max(candidates, key=lambda x: (x.count(' '), len(x)))
# returns:
'van for'
James
  • 32,991
  • 4
  • 47
  • 70