1

I need to split a string up into all possible ways without changing the order of the characters. I understand this task can be seen as tokenization or lemmatization in NLP but i am attempting it from a pure string search perspective which is simpler and more robust. Given,

dictionary = ['train','station', 'fire', 'a','trainer','in']
str1 = "firetrainstation"

Task 1: How do i generate all possible substrings such that i get:

all_possible_substrings = [['f','iretrainstation'],
['fo','retrainstation'], ...
['firetrainstatio','n'],
['f','i','retrainstation'], ... , ...
['fire','train','station'], ... , ...
['fire','tr','a','instation'], ... , ...
['fire','tr','a','in','station'], ... , ...
['f','i','r','e','t','r','a','i','n','s','t','a','t','i','o','n']

Task 2: Then from all_possible_substring, how could i check through to see and say that the substring set that contains all elements from the dictionary is the correct output. The desired output would be a list of substrings from the dictionary that matches the most number of characters from left to right. Desired output is where:

"".join(desire_substring_list) == str1 and \
[i for i desire_substring_list if in dictionary] == len(desire_substring_list)
#(let's assume, the above condition can be true for any input string since my english
#language dictionary is very big and all my strings are human language 
#just written without spaces)

Desired output:

'fire','train','station'

What have I done?

For task 1, i have done this but i know it wouldn't give me all possible whitespace insertions:

all_possible_substrings.append(" ".join(str1))

I've done this but this only does task 2:

import re
seed = ['train','station', 'fire', 'a','trainer','in']
str1 = "firetrainstation"
all_possible_string = [['f','iretrainstation'],
['fo','retrainstation'],
['firetrainstatio','n'],
['f','i','retrainstation'], 
['fire','train','station'], 
['fire','tr','a','instation'], 
['fire','tr','a','in','station'], 
['f','i','r','e','t','r','a','i','n','s','t','a','t','i','o','n']]
pattern = re.compile(r'\b(?:' + '|'.join(re.escape(s) for s in seed) + r')\b')
highest_match = ""
for i in all_possible_string:
  x = pattern.findall(" ".join(i))
  if "".join(x) == str1 and len([i for i in x if i in seed]) == len(x):
    print " ".join(x)
alvas
  • 115,346
  • 109
  • 446
  • 738
  • Note that your dictionary is actually a `list`. – mgilson Mar 19 '13 at 01:24
  • Also, I'm pretty sure you need to do a little more explanation. Why is `'foo','bar','bar','str' the desired output? – mgilson Mar 19 '13 at 01:25
  • updated on the desired output. – alvas Mar 19 '13 at 01:35
  • is it clearer in this context? – alvas Mar 19 '13 at 01:53
  • How do you get `str1` from `dictionary`? And I might be misunderstanding, but wouldn't the "list of substrings from the dictionary that matches the most number of characters from left to right" always be `str1` minus the last letter? (Assuming you don't want the whole string.) – toxotes Mar 19 '13 at 01:55
  • @toxotes, str1 is just a concated string without whitespaces, it happens because some languages just hates to use whitespaces. dictionary comes from a real dictionary of human language word, not to be confused with python's dictionary. No, i dont think the matches will always be `len(str1[:-1])` – alvas Mar 19 '13 at 01:58
  • Oh, I think I see. You want all possible partitions of `str1`, and then find the partition with the best match to `dictionary`, is that right? – toxotes Mar 19 '13 at 01:59
  • yep. in english context, it looks clearer. – alvas Mar 19 '13 at 01:59
  • Well, I think this might get tricky. If I remember my combinatorics right, there are 2^n-1 ways to partition a set of n objects (letters, here). Going by http://stackoverflow.com/questions/855191/how-big-can-a-python-array-get a list can hold about 530 million elements, and you'll hit that when `str1` is only 30 letters. – toxotes Mar 19 '13 at 02:22
  • @toxotes yes 2^n-1 is right. But using a generator (see my answer below) there is no need to store all values in a list. Every combination can be matched against the desired output when looping over the generator. – jurgenreza Mar 19 '13 at 15:05

1 Answers1

3

For the first part you could write a recursive generator similar to this:

>>> def all_substr(string):
    for i in range(len(string)):

        if i == len(string) - 1:
            yield string

        first_part = string[0:i+1]
        second_part = string[i+1:]

        for j in all_substr(second_part):
            yield ','.join([first_part, j])


>>> for x in all_substr('apple'):
    print(x)


a,p,p,l,e
a,p,p,le
a,p,pl,e
a,p,ple
a,pp,l,e
a,pp,le
a,ppl,e
a,pple
ap,p,l,e
ap,p,le
ap,pl,e
ap,ple
app,l,e
app,le
appl,e
apple
jurgenreza
  • 5,856
  • 2
  • 25
  • 37