Reconstruct input string given ngrams of that string

Question

Given a string, e.g. i am a string.

I can generate the n-grams of this string like so, using the nltk package, where n is variable as per a specified range.

from nltk import ngrams 

s = 'i am a string'
for n in range(1, 3):
    for grams in ngrams(s.split(), n):
        print(grams)

Gives the output:

('i',)
('am',)
('a',)
('string',)
('i', 'am')
('am', 'a')
('a', 'string')

Is there a way to 'reconstruct' the original string using combinations of the generated ngrams? Or, in the words of the below commenter, is there a way to divide the sentence into consecutive word sequences where each sequence has a maximum length of k (in this case k is 2).

[('i'), ('am'), ('a'), ('string')]
[('i', 'am'), ('a'), ('string')]
[('i'), ('am', 'a'), ('string')]
[('i'), ('am'), ('a', 'string')]
[('i', 'am'), ('a', 'string')]

The question is similar to this one, though with an additional layer of complexity.

Working solution - adapted from here.

I have a working solution, but it's really slow for longer strings.

def get_ngrams(s, min_=1, max_=4):
    token_lst = []
    for n in range(min_, max_):
        for idx, grams in enumerate(ngrams(s.split(), n)):
            token_lst.append(' '.join(grams))
    return token_lst 

def to_sum_k(s):
    for len_ in range(1, len(s.split())+1):
        for i in itertools.permutations(get_ngrams(s), r=len_):
            if ' '.join(i) == s:
                print(i)

to_sum_k('a b c')

Does this answer your question? [Find all combinations of \`n\` positive numbers adding up to \`k\` in Python?](https://stackoverflow.com/questions/62344469/find-all-combinations-of-n-positive-numbers-adding-up-to-k-in-python) — norok2, Oct 03 '20 at 23:34
Basically you use that and a cumulative sum to find out all possible splits of your input, and that is equivalent to what you are looking for. — norok2, Oct 03 '20 at 23:36
"akin to the below" is not very precise; you should try to make your problem descriptions as explicit as possible. Do you mean "divide the sentence into consecutive word sequences where each sequence has a maximum length of `k` (in this case `k` is 2)"? Or did you specifically mean divide the sentence into unigrams qnd bigrams in all possible ways? — rici, Oct 04 '20 at 01:16
Hi @rici, you've hit the nail on the head. I've updated the question to be more specific. — Ian, Oct 04 '20 at 08:53
@norok2 while this solution would work, I'm trying to avoid brute forcing this — Ian, Oct 04 '20 at 08:53
@Andy what do you mean by "brute forcing"? The accepted solution from the question I suggested does not filter out anything. It produces only the "correct" ones. Of course, you do need to modify it to include the additional constraint you have on the maximum value. — norok2, Oct 04 '20 at 10:09
In your sample-code it looks like the original string is already known and you're simply looking for a way to check whether there is a subset of ngrams that form the original string. Is that actually the question or are only the ngrams given and the question is to determine whether there exists a unique possible sentence formed of these? The former can be solved fairly easily with a range-tree. The latter is... well, you've seen my answer. — , Oct 07 '20 at 11:59
Hi Paul, sorry for not making this clear, and thanks for your answer below. The original string is known. I'm looking to find all the combinations of n-grams (for say, n∈{1,2,3}) which, when put back together can reconstruct the original string. For example, given the string `'hello world'`, the output combinations I want would be `['hello', 'world']`, and `['hello world']`, but *not* `['world', 'hello']`, `['hello']`, or `['world']`. — Ian, Oct 07 '20 at 12:07
Can you expand, or point me somewhere, for how I'd use a range tree to solve this problem? — Ian, Oct 07 '20 at 12:11
@Andy my original answer assumed that we don't know the original string. I'll add a new answer — , Oct 07 '20 at 14:38
Correction for my previous comment: I meant a segment-tree, not a range-tree. — , Oct 07 '20 at 14:40
I just noticed another missing detail: are repetitions allowed? E.g. is `[("A"), ("A")]` a valid output for `"A A"` or only `[("A", "A")]`? — , Oct 08 '20 at 11:49
Thanks for the question - both your suggested outputs should be valid. i.e. in the case of the string `"A A"`, the expected outputs would be `[("A", "A")]` *and* `[("A"), ("A")]` — Ian, Oct 08 '20 at 12:54

score 3 · Answer 1 · 2020-10-09T16:16:08.467

EDIT:
This answer was based on the assumption that the question was to reconstruct an unknown unique string based on it's ngrams. I'll leave it active for anyone interested in this problem. The actual answer for the actual problem as clarified in the comments can be found here.
EDIT END

In general no. Consider e.g. the case n = 2 and s = "a b a b". Then your ngrams would be

[("a"), ("b"), ("a", "b"), ("b", "a")]

The set of strings that generate this set of ngrams in this case however would be all that may be generated by

(ab(a|(ab)*a?))|(ba(b|(ba)*b?)

Or n = 2, s = "a b c a b d a", where "c" and "d" may be arbitrarily ordered within the generating strings. E.g. "a b d a b c a" would also be a valid string. In addition the same issue as above arises and an arbitrary number of strings can generate the set of ngrams.

That being said there exists a way to test whether a set of ngrams uniquely identifies a string:
Consider your set of strings as a description of a non-deterministic state-machine. Each ngram can be defined as a chain of states where the single characters are transitions. As an example for the ngrams [("a", "b", "c"), ("c", "d"), ("a", "d", "b")] we would build the following state-machine:

0 ->(a) 1 ->(b) 2 ->(c) 3
0 ->(c) 3 ->(d) 4
0 ->(a) 1 ->(d) 5 ->(b) 6

Now perform a determinization of this state-machine. Iff there exists a unique string that can be reconstructed from the ngrams, the state-machine will have a longest transition-chain that doesn't contain any cycles and contains all ngrams we built the original state-machine from. In this case the original string is simply the individual state-transitions of this path joined back together. Otherwise there exist multiple strings that can be built from the provided ngrams.

Wouldn't baba also be valid for the ngrams in your first example? — Hans Olsson, Oct 06 '20 at 13:20
@HansOlsson Good catch. I've completely missed that possibility. I'll update the answer — , Oct 07 '20 at 07:51

score 1 · Accepted Answer · 2020-10-09T14:46:45.940

While my previous answer assumed that the problem was to find an unknown string based on it's ngrams, this answer will deal with the problem of finding all ways to construct a given string using it's ngrams.

Assuming repetitions are allowed the solution is fairly simple: Generate all possible number sequences summing up to the length of the original string with no number larger than n and use these to create the ngram-combinations:

import numpy

def generate_sums(l, n, intermediate):
    if l == 0:
        yield intermediate
    elif l < 0:
        return
    else:
        for i in range(1, n + 1):
            yield from generate_sums(l - i, n, intermediate + [i])

def combinations(s, n):
    words = s.split(' ')
    for c in generate_sums(len(words), n, [0]):
        cs = numpy.cumsum(c)
        yield [words[l:u] for (l, u) in zip(cs, cs[1:])]

EDIT:
As pointed out by @norok2 (thanks for the work) in the comments, it seems to be faster to use alternative cumsum-implementations instead of the one provided by numpy for this usecase.
END EDIT

If repetitions are not allowed things become a little bit more tricky. In this case we can use a non-deterministic finite automaton as defined in my previous answer and build our sequences based on traversals of the automaton:

def build_state_machine(s, n):
    next_state = 1
    transitions = {}
    for ng in ngrams(s.split(' '), n):
        state = 0
        for word in ng:
            if (state, word) not in transitions:
                transitions[(state, word)] = next_state
                next_state += 1

            state = transitions[(state, word)]

     return transitions

def combinations(s, n):
    transitions = build_state_machine(s, n)
    states = [(0, set(), [], [])]

    for word in s.split(' '):
        new_states = []
        for state, term_visited, path, cur_elem in states:
            if state not in term_visited:
                new_states.append((0, term_visited.union(state), path + [tuple(cur_elem)], []))
            if (state, word) in transitions:
                new_states.append((transitions[(state, word)], term_visited, path, cur_elem + [word]))

        states = new_states

   return [path + [tuple(cur_elem)] if state != 0 else path for (state, term_visited, path, cur_elem) in states if state not in term_visited]

As an example the following state machine would be generated for the string "a b a":

Red connections indicate a switch to the next ngram and need to be handled separately (second if in the loop), since they can only be traversed once.

Great and insightful answer, thanks kindly! Only change I'd make is, I think `yield [s[l:u] for (l, u) in zip(cs, cs[1:])]` should be changed to `yield [words[l:u] for (l, u) in zip(cs, cs[1:])]` :D — Ian, Oct 08 '20 at 21:55
I think that the explicit computation of the cumulative sum is less efficient (and requires importing numpy) than a manual looping solution. My [quick tests](https://colab.research.google.com/drive/1xJGihx6eq36MIiy5SuwQC2MsgCm9u83F?usp=sharing) indicate a factor 3 speed-up. — norok2, Oct 09 '20 at 11:21
@norok2 thanks for doing that work, I'll add it to my answer. I didn't expect that actually; seems rather odd. Sidenote: In general these snippets are rather intended to show the basic algorithms in a (hopefully) understandable way than to be performant. There's plenty of room left for optimization in both pieces of code. — , Oct 09 '20 at 14:36
@norok2 kind thanks for prepping and running those tests - really useful and interesting — Ian, Oct 09 '20 at 15:35

Reconstruct input string given ngrams of that string

2 Answers2