Reduce a string using grammar-like rules

Question

I'm trying to find a suitable DP algorithm for simplifying a string. For example I have a string a b a b and a list of rules

a b -> b
a b -> c
b a -> a
c c -> b

The purpose is to get all single chars that can be received from the given string using these rules. For this example it will be b, c. The length of the given string can be up to 200 symbols. Could you please prompt an effective algorithm?

Rules always are 2 -> 1. I've got an idea of creating a tree, root is given string and each child is a string after one transform, but I'm not sure if it's the best way.

does the rule always have 2 to 1 maping? what is your best solution ? the shortest string? the string with most same letters? — cerkiewny, May 08 '15 at 20:31
You need to show us some effort. Have you considered how you would approach this problem? Give us your thoughts. *Try something.* When you get stuck, post showing us what you've tried and explain what the problem is. — Jim Mischel, May 08 '15 at 20:32
And what do you mean with [DP](http://www.acronymfinder.com/Information-Technology/DP.html)? — Bas Swinckels, May 08 '15 at 20:35
@Dialecticus I thought so, but I hate to be guessing. There is even a [tag](http://stackoverflow.com/questions/tagged/dynamic-programming) for that ... — Bas Swinckels, May 08 '15 at 20:39
@Bas DP is dynamic programming. Sorry for mistake, I've already added the tag. — user2875945, May 08 '15 at 20:55
@cerkiewny yes, the rule is always 2 to 1. There is no best solution, purpose is to get all possible solutions. — user2875945, May 08 '15 at 20:56
Why is `a b -> ..` twice in your list? Also: is every output character written directly, i.e., not processed anymore, or do you keep it as the next 'first' character and restart at the top? — Jongware, May 08 '15 at 21:09
@Jongware 'a b' can be transformed to 'b' or to 'c', so there are two rules. I'm not sure I've understood your second question right, but it doesn't matter if 'b' can be received in 2 different ways, only thing that matters is that it can be received. — user2875945, May 08 '15 at 21:17
In your example `a b a b`, the first transformation is Rule 1: `a b -> b` (or Rule 2: `a b - > c` - but how would you know which one to use?). Does that lead to the new string `b a b` ( and use Rule 3) or is `b` immediately "consumed", leaving `a b` as next input (and use Rule 1 again)? — Jongware, May 08 '15 at 21:24
@Jongware Using Rule 1 on first letters of `a b a b` leads to string `b a b` — user2875945, May 08 '15 at 21:28

score 2 · Answer 1 · edited May 23 '17 at 12:05

For a DP problem, you always need to understand how you can construct the answer for a big problem in terms of smaller sub-problems. Assume you have your function simplify which is called with an input of length n. There are n-1 ways to split the input in a first and a last part. For each of these splits, you should recursively call your simplify function on both the first part and the last part. The final answer for the input of length n is the set of all possible combinations of answers for the first and for the last part, which are allowed by the rules.

In Python, this can be implemented like so:

rules = {'ab': set('bc'), 'ba': set('a'), 'cc': set('b')}
all_chars = set(c for cc in rules.values() for c in cc)

@ memoize
def simplify(s):
    if len(s) == 1:  # base case to end recursion
        return set(s)

    possible_chars = set()

    # iterate over all the possible splits of s
    for i in range(1, len(s)):
        head = s[:i]
        tail = s[i:]

        # check all possible combinations of answers of sub-problems
        for c1 in simplify(head):
            for c2 in simplify(tail):
                possible_chars.update(rules.get(c1+c2, set()))

                # speed hack
                if possible_chars == all_chars: #  won't get any bigger
                    return all_chars

    return possible_chars

Quick check:

In [53]: simplify('abab')
Out[53]: {'b', 'c'}

To make this fast enough for large strings (to avoiding exponential behavior), you should use a memoize decorator. This is a critical step in solving DP problems, otherwise you are just doing a brute-force calculation. A further tiny speedup can be obtained by returning from the function as soon as possible_chars == set('abc'), since at that point, you are already sure that you can generate all possible outcomes.

Analysis of running time: for an input of length n, there are 2 substrings of length n-1, 3 substrings of length n-2, ... n substrings of length 1, for a total of O(n^2) subproblems. Due to the memoization, the function is called at most once for every subproblem. Maximum running time for a single sub-problem is O(n) due to the for i in range(len(s)), so the overall running time is at most O(n^3).

I'm not good at Python and have several questions. Does `head = s[:i]; tail = s[i:]` split string in two parts on index i or take i char's from both ends of the string? And what makes `possible_chars.update(rules.get(c1+c2, set()))`? — user2875945, May 09 '15 at 00:37
1) that is slice notation, `s[:i] == s[0:i]`, which corresponds to the first `i` characters, and `s[i:] == s[i:len(s)]` takes all but the first `i` characters. So for an input `abcd`, the loop over `i` would split this into `head, tail = 'a','bcd'`, `head, tail = 'ab', 'cd'` and `head, tail = 'abc', 'd'`. — Bas Swinckels, May 09 '15 at 06:30
2) That is a compact way for: concatenate chars `c1` and `c2`, look up the possible set of answers for this combination in the rules dictionary, and finally make `possible_chars` the [union](http://en.wikipedia.org/wiki/Union_%28set_theory%29) of itself and possible set of answers from the rules. The `get(key, default)` method on a dictionary looks up a key in a dictionary, and if it is not found returns a default value. I use this to return an empty set `set()` in case the 2-letter combination is not in the rules, so the union with that does nothing. — Bas Swinckels, May 09 '15 at 06:38
A more explicit way would be to write `if c1+c2 in rules: possible_chars = possible_chars.union(rules[c1+c2])`. — Bas Swinckels, May 09 '15 at 06:45
For the completeness of the code I think it's good to add a definition of `memoize`. Or it's taken from some Python library that I'm not aware of? — Adam Stelmaszczyk, May 09 '15 at 22:11

score 2 · Answer 2 · answered May 08 '15 at 23:08

If you read those rules from right to left, they look exactly like the rules of a context free grammar, and have basically the same meaning. You could apply a bottom-up parsing algorithm like the Earley algorithm to your data, along with a suitable starting rule; something like

start <- start a
       | start b
       | start c

and then just examine the parse forest for the shortest chain of starts. The worst case remains O(n^3) of course, but Earley is fairly effective, these days.

You can also produce parse forests when parsing with derivatives. You might be able to efficiently check them for short chains of starts.

score 1 · Answer 3 · edited May 23 '17 at 11:43

Let N - length of given string and R - number of rules.

Expanding a tree in a top down manner yields computational complexity O(NR^N) in the worst case (input string of type aaa... and rules aa -> a).

Proof:

Root of the tree has (N-1)R children, which have (N-1)R^2 children, ..., which have (N-1)R^N children (leafs). So, the total complexity is O((N-1)R + (N-1)R^2 + ... (N-1)R^N) = O(N(1 + R^2 + ... + R^N)) = (using binomial theorem) = O(N(R+1)^N) = O(NR^N).

Recursive Java implementation of this naive approach:

public static void main(String[] args) {
    Map<String, Character[]> rules = new HashMap<String, Character[]>() {{
        put("ab", new Character[]{'b', 'c'});
        put("ba", new Character[]{'a'});
        put("cc", new Character[]{'b'});
    }};
    System.out.println(simplify("abab", rules));
}

public static Set<String> simplify(String in, Map<String, Character[]> rules) {
    Set<String> result = new HashSet<String>();
    simplify(in, rules, result);
    return result;
}

private static void simplify(String in, Map<String, Character[]> rules, Set<String> result) {
    if (in.length() == 1) {
        result.add(in);
    }
    for (int i = 0; i < in.length() - 1; i++) {
        String two = in.substring(i, i + 2);
        Character[] rep = rules.get(two);
        if (rep != null) {
            for (Character c : rep) {
                simplify(in.substring(0, i) + c + in.substring(i + 2, in.length()), rules, result);
            }
        }
    }
}

Bas Swinckels's O(RN^3) Java implementation (with HashMap as a memoization cache):

public static Set<String> simplify2(final String in, Map<String, Character[]> rules) {
    Map<String, Set<String>> cache = new HashMap<String, Set<String>>();
    return simplify2(in, rules, cache);
}

private static Set<String> simplify2(final String in, Map<String, Character[]> rules, Map<String, Set<String>> cache) {
    final Set<String> cached = cache.get(in);
    if (cached != null) {
        return cached;
    }
    Set<String> ret = new HashSet<String>();
    if (in.length() == 1) {
        ret.add(in);
        return ret;
    }
    for (int i = 1; i < in.length(); i++) {
        String head = in.substring(0, i);
        String tail = in.substring(i, in.length());
        for (String c1 : simplify2(head, rules)) {
            for (String c2 : simplify2(tail, rules, cache)) {
                Character[] rep = rules.get(c1 + c2);
                if (rep != null) {
                    for (Character c : rep) {
                        ret.add(c.toString());
                    }
                }
            }
        }
    }
    cache.put(in, ret);
    return ret;
}

Output in both approaches:

[b, c]

I am not very fluent in Java, but does your second implementation use memoization (i.e. caching the result for fixed inputs)? If not, it is not `O(n^3)`, but exponential. — Bas Swinckels, May 09 '15 at 15:04
Thank you, both algorithms work perfectly, but I need a little help with optimization. The second algorithm is fast, but I need it to be a little faster. May be you could give me some advice on optimization, for example use a faster data structure or more effective loop? — user2875945, May 11 '15 at 04:27
@user2875945 I didn't implement Bas Swinckels "speed hack" (look at his code), this should help with performance. — Adam Stelmaszczyk, May 11 '15 at 11:15

Reduce a string using grammar-like rules

3 Answers3