Match string with a list to retrieve iteratively

Question

There are two lists, one consists of a character sequence of a sentence, and another consisting of words.

The objective is to match the items of a first list with the second for length of li_a (len(li_a)) times. The same matched words will be temporarily saved as candidates. After the final iteration process, the longest word will be chosen as our expected result, and appended into a new list.

Since there are 18 characters are in li_a, let assume the literation time is 18.

li_a = ['T','h','o','m','a','s','h','a','d','a','h','a','r','d','t','i','m','e']
li_words = ['a','The','Thomas','have','had','has','hard','hot','time','tea']

First, the first item in li_a is matched against li_words.

1. 'T' => li_words || li_a[0] => li_words
2. 'Th' => li_words || li_a[0]+li_a[1] => li_words
3. 'Tho' => li_words || li_a[0]+li_a[1]+li_a[2] => li_words
...
6. 'Thomas' => li_words || li_a[0]+..+li_a[5] => li_words (marks as candidate when the match is found)
...
18. 'Thomashadahardtime' => li_words || li_a[0]..li_a[17] => li_words

The above example shows how the first iterative process should be done. And it gives us with one candidate result which is Thomas. But then, the items of li_a from first 'T' to 's' (Thomas) will be deducted,

li_a = ['h','a','d','a','h','a','r','d','t','i','m','e']

and second iterative process like previous should be performed to retrieve next word.

Finally, our final result of a list should be like this:

final_li = ['Thomas','had','a','hard','time']

Attempt

Below attempt works for finding the longest match, but not for the iterative work and doesn't give the accurate result when there is a missing word in li_words

def matched_substring(li1, li2):
    new_li = []
    tmp = ''
    for a in li1:
        tmp += a
        count = 0
        for b in li2:
            if tmp == b:
                count += 1
        if count == 0:
            tmp1 = tmp.replace(a, '')
            new_li.append(tmp1)
            tmp = a
    if li2.__contains__(tmp):
        new_li.append(tmp) 
    return new_li

It returns as:['Thomas', 'h', 'a', 'd', 'a', 'h', 'a', 'r', 'd', 't', 'i', 'm']

THE CHARACTERS IN UNICODE

string_a = "['ဒီ|စစ်|ဆေး|မှု|ကို|သီး|ခြား|လွတ်|လပ်|တဲ့|ပု|ဂ္ဂို|လ်|တ|ဦး|က|ဦး|ဆောင်|ခိုင်း|တာ|ဟာ|လူ|ထု|အ|ကျိုး|အ|တွက်|ဖြစ်|တယ်|လို့|တ|ရား|ရေး|ဝန်|ကြီး|ဌာ|န|က|ထုတ်|ပြန်|တဲ့|ကြေ|ညာ|ချက်|ထဲ|မှာ|ဖေါ်|ပြ|ထား|ပါ|တယ်']"

To convert above String to List:

##Get rid of brackets & punctuation marks
strp_str = string_a.strip("[]")
strp_str = strp_str.strip("'")
##Now we achieve *li_a* 
li_a = strp_str.split('|')

Link to clipboard for li_words list: mm-words.txt

##Get all the words in List
read_words = open('mm-words.txt','r')
##Achieve them in List    
li_words = read_words.read().split('\n')

##Now run into function
print analyze(li_a, li_words)

Since the problem has equivalent subproblems, this is a great opportunity to explore recursion or dynamic-programming approaches. — wim, May 23 '17 at 07:09
I think it will be easy to do loop item in `li_words` and match item in `li_a` — Cheney, May 23 '17 at 07:11

score 2 · Answer 1 · answered May 23 '17 at 07:21

2

Maybe you can try to match the string generated by li_a, e.g.

>>> li_a = ['T','h','o','m','a','s','h','a','d','a','h','a','r','d','t','i','m','e']
>>> li_words = ['a','The','Thomas','have','had','has','hard','hot','time','tea']
>>>
>>> s = "".join(li_a)
>>> for i in sorted(li_words,key=lambda x:-len(x)):
...     if i in s:
...         s=s.replace(i,str(li_words.index(i))+",")
...
>>> [li_words[int(i)] for i in s[:-1].split(",")]
['Thomas', 'had', 'a', 'hard', 'time']

Hope this helps.

answered May 23 '17 at 07:21

McGrady

10,869
13
47
69

Thanks for your quick response, but there's an error stated as `ValueError: invalid literal for int() with base 10:` – Htet May 23 '17 at 08:26
My code works well, when you use `int(i)` make sure `i` is not empty string. – McGrady May 23 '17 at 09:53
I've tried your code. It works well for English alphabets, but it says `ValueError: invalid literal for int() with base 10:` for unicode characters. – Htet May 26 '17 at 07:40

ewcz · Accepted Answer · 2017-05-23T19:55:12.960

In order to obtain all possible solutions (in case there are also some "overlapping" words), one might first for each word find all start positions where this particular word might appear. Then the strategy would be to start at the beginning of the string, test all candidates which can appear here and move the corresponding number of characters forward in the string. At this new position, the process is repeated (thanks to the recursion, all combinations are explored). Once we reached the end of the string, a successful solution has been found.

li_a = ['T', 'h','o','m','a','s','h','a','d','a','h','a','r','d','t','i','m','e','a']
li_words = ['a','The','Thomas','have','had','has','hard','hot','time','tea','timea']


def analyze(li_a, li_words):

    s = ''.join(li_a)

    #for each word, find all its positions in s
    stat = {}
    for word in li_words:
        pos = s.find(word)
        while pos != -1:
            if not pos in stat:
                stat[pos] = []
            stat[pos].append(word)
            pos = s.find(word, pos + 1)

    solutions = []

    def check(idx, solution):
        if idx == len(s):
            solutions.append(solution)
            return

        #at position idx, test all candidates and call
        #itself recursively
        words = stat.get(idx, [])
        for word in words:
            check(idx + len(word), solution + [word])

    #start at the beginning
    check(0, [])

    return solutions

print(analyze(li_a, li_words))

EDIT:

I tested your Unicode input, it seems that string_a contains a word ဖေါ် which is missing in mm-words.txt. On the other hand, the proposed solution would fail in this case anyway. First, there was a mistake in pos = s.find(word, pos + len(word)), it's supposed to be pos = s.find(word, pos + 1) in order to find all overlapping occurrences of a particular word. More importantly, the abhorrent complexity demands (essentially exponential scaling) makes it unusable on such large input. The way to go is to adopt dynamic programming approach. In the example below, I take only first 10 words from string_a (saved into str.txt) in order to avoid the missing one:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

def read_data(fname, delim, uniq = False):
    with open(fname, 'r') as F:
        data = F.read().split(delim)
        data = map(lambda s: s.decode('utf-8').strip(), data)
        data = filter(lambda s: s, data)
        if uniq:
            data = list(set(data))
    return data

li_a = read_data('str.txt', '|')
li_words = read_data('mm-words.txt', '\n', uniq = True)

def analyze(li_a, li_words):
    words = set(li_words)

    s = ''.join(li_a[0:10])
    N = len(s)

    solutions = [ [] for idx in range(0, N) ]
    S = [False for idx in range(0, N)]

    for i in range(0, N):
        flag = False
        if (s[0:i+1] in words):
            flag = True
            solutions[i].append([s[0:i+1], -1])

        else:
            for j in range(1, i+1):
                if S[j-1] and (s[j:i+1] in words):
                    #save the newly identified word and reference to solution
                    #previously found at location j-1
                    solutions[i].append([s[j:i+1], j-1])

                    #break #find only one solution
                    flag = True
        S[i] = flag

    splittings = []
    def assemble(pos, L):
        if pos == -1:
            splittings.append(L)
            return
        for w,idx in solutions[pos]:
            assemble(idx, [w] + L)

    assemble(N-1, [])
    return splittings

splittings = analyze(li_a, li_words)
for splitting in splittings:
    print(' | '.join(splitting).encode('utf-8'))

this produces:

ဒီ | စစ်ဆေး | မှု | ကို | သီးခြား | လွတ်လပ် | တဲ့
ဒီ | စစ် | ဆေး | မှု | ကို | သီးခြား | လွတ်လပ် | တဲ့
ဒီ | စ | စ် | ဆေး | မှု | ကို | သီးခြား | လွတ်လပ် | တဲ့
ဒီ | စစ်ဆေး | မှု | ကို | သီး | ခြား | လွတ်လပ် | တဲ့
ဒီ | စစ် | ဆေး | မှု | ကို | သီး | ခြား | လွတ်လပ် | တဲ့
ဒီ | စ | စ် | ဆေး | မှု | ကို | သီး | ခြား | လွတ်လပ် | တဲ့
ဒီ | စစ်ဆေး | မှု | ကို | သီးခြား | လွတ် | လပ် | တဲ့
ဒီ | စစ် | ဆေး | မှု | ကို | သီးခြား | လွတ် | လပ် | တဲ့
ဒီ | စ | စ် | ဆေး | မှု | ကို | သီးခြား | လွတ် | လပ် | တဲ့
ဒီ | စစ်ဆေး | မှု | ကို | သီး | ခြား | လွတ် | လပ် | တဲ့
ဒီ | စစ် | ဆေး | မှု | ကို | သီး | ခြား | လွတ် | လပ် | တဲ့
ဒီ | စ | စ် | ဆေး | မှု | ကို | သီး | ခြား | လွတ် | လပ် | တဲ့
ဒီ | စစ်ဆေး | မှု | ကို | သီးခြား | လွ | တ် | လပ် | တဲ့
ဒီ | စစ် | ဆေး | မှု | ကို | သီးခြား | လွ | တ် | လပ် | တဲ့
ဒီ | စ | စ် | ဆေး | မှု | ကို | သီးခြား | လွ | တ် | လပ် | တဲ့
ဒီ | စစ်ဆေး | မှု | ကို | သီး | ခြား | လွ | တ် | လပ် | တဲ့
ဒီ | စစ် | ဆေး | မှု | ကို | သီး | ခြား | လွ | တ် | လပ် | တဲ့
ဒီ | စ | စ် | ဆေး | မှု | ကို | သီး | ခြား | လွ | တ် | လပ် | တဲ့
ဒီ | စစ်ဆေး | မှု | ကို | သီးခြား | လွတ် | လ | ပ် | တဲ့
ဒီ | စစ် | ဆေး | မှု | ကို | သီးခြား | လွတ် | လ | ပ် | တဲ့
ဒီ | စ | စ် | ဆေး | မှု | ကို | သီးခြား | လွတ် | လ | ပ် | တဲ့
ဒီ | စစ်ဆေး | မှု | ကို | သီး | ခြား | လွတ် | လ | ပ် | တဲ့
ဒီ | စစ် | ဆေး | မှု | ကို | သီး | ခြား | လွတ် | လ | ပ် | တဲ့
ဒီ | စ | စ် | ဆေး | မှု | ကို | သီး | ခြား | လွတ် | လ | ပ် | တဲ့
ဒီ | စစ်ဆေး | မှု | ကို | သီးခြား | လွ | တ် | လ | ပ် | တဲ့
ဒီ | စစ် | ဆေး | မှု | ကို | သီးခြား | လွ | တ် | လ | ပ် | တဲ့
ဒီ | စ | စ် | ဆေး | မှု | ကို | သီးခြား | လွ | တ် | လ | ပ် | တဲ့
ဒီ | စစ်ဆေး | မှု | ကို | သီး | ခြား | လွ | တ် | လ | ပ် | တဲ့
ဒီ | စစ် | ဆေး | မှု | ကို | သီး | ခြား | လွ | တ် | လ | ပ် | တဲ့
ဒီ | စ | စ် | ဆေး | မှု | ကို | သီး | ခြား | လွ | တ် | လ | ပ် | တဲ့

I've tried the code and it works like a charm, but it returns empty list when I tried on Unicode characters. Any suggestions such situation? And thank you for your excellent work. — Htet, May 23 '17 at 15:31
my pleasure :) you are using Python2, I suppose? could you post the particular `li_a`/`li_words` that produce empty output? it's most likely related to how `len` is interpreted in such situations (i.e., number of bytes vs number of characters)... — ewcz, May 23 '17 at 15:41
Thanks for bearing with me. I extended the post with **THE CHARACTERS IN UNICODE** for detail. Thanks in advance :) — Htet, May 23 '17 at 17:09
@htetmyet I have updated the answer to address the test data, the original method was too slow for this so it was necessary to resort to slightly different strategy... — ewcz, May 23 '17 at 19:59
No words to express my feelings towards your effort. Couldn't imagine the struggle I stuck in without your help :D Much appreciated!! You just saved my day. Thank you :) — Htet, May 24 '17 at 05:30
I guess the reason it doesn't reply the full string is the length you set here `s = ''.join(li_a[0:10])`. Though I tried to set `10` to the length of `li_a` or just `s = ''.join(li_a)`, it returns invalid result. Any reason behind you set it to `0:10`? — Htet, May 24 '17 at 07:42
@htetmyet yes, I want to avoid the `ဖေါ်` which does not appear in the list of words you posted (so in that case it found no solution) — ewcz, May 24 '17 at 08:09

score 0 · Answer 3 · answered May 23 '17 at 07:17

Here you have a functional approach using itertools:

import itertools
li_a = ['T','h','o','m','a','s','h','a','d','a','h','a','r','d','t','i','m','e']
li_words = ['a','The','Thomas','have','had','has','hard','hot','time','tea']
res = list(itertools.imap(lambda x: max(itertools.ifilter(lambda y: x in y, li_words), key=len), li_a))
res
['Thomas', 'Thomas', 'Thomas', 'Thomas', 'Thomas', 'Thomas', 'Thomas', 'Thomas', 'hard', 'Thomas', 'Thomas', 'Thomas', 'hard', 'hard', 'time', 'time', 'Thomas', 'have']

The idea is that for each letter in li_a we filter in which words fo li_words that letter is, then just use max upon that collection using len for taking the largest one.

Here the matching zipped per letter:

zip(li_a, res)
[('T', 'Thomas'), ('h', 'Thomas'), ('o', 'Thomas'), ('m', 'Thomas'), ('a', 'Thomas'), ('s', 'Thomas'), ('h', 'Thomas'), ('a', 'Thomas'), ('d', 'hard'), ('a', 'Thomas'), ('h', 'Thomas'), ('a', 'Thomas'), ('r', 'hard'), ('d', 'hard'), ('t', 'time'), ('i', 'time'), ('m', 'Thomas'), ('e', 'have')]

Match string with a list to retrieve iteratively

3 Answers3