Compressing string using ASCII char codes using backtracking in php

Question

I want to can compress a string using the chars ASCII codes. I want to compress them using number patterns. Because ASCII codes are numbers, I want to find sub-patterns in the list of ASCII char codes.

Theory

This will be the format for every pattern I found:

[nnn][n][nn], where:
- [nnn] is the ASCII code for first char, from group numbers with same pattern.
- [n] is a custom number for a certain pattern/rule (I will explain more below).
- [nn] shows how many times this rule happens.

The number patterns are not concretely established. But let me give you some examples:

same char
linear growth (every number/ascii is greater, with one, than previous)
linear decrease (every number/ascii is smaller, with one, than previous)

Now let's see some situations:

"adeflk" becomes "097.1.01-100.2.03-108.3.02"
- same char ones, linear growth three times, linear decrease twice.
"rrrrrrrrrrr" becomes "114.1.11"
- same char eleven times.
"tsrqpozh" becomes "116.3.06-122.1.01-104.1.01"
- linear decrease six times, same char ones, same char ones.

I added dots ('.') and dashes ('-') so you can see them easily.

Indeed, we don't see good results (compression). I want to use this algorithm for large strings. And adding more rules (number patterns) we increase changes for making shorter result than original. I know the existent compressing solutions. I want this solution because the result have only digits, and it helps me.

What I've tried

// recursive function
function run (string $data, array &$rules): string {
    if (strlen($data) == 1) {
        // str_pad for having always ASCII code with 3 digits
        return (str_pad(ord($data), 3, '0', STR_PAD_LEFT) .'.'. '1' .'.'. '01');
    }

    $ord = ord($data); // first char
    $strlen = strlen($data);
    $nr = str_pad($ord, 3, '0', STR_PAD_LEFT); // str_pad for having always ASCII code with 3 digits
    $result = '';

    // compares every rule
    foreach ($rules as $key => $rule) {
        for ($i = 1; $i < $strlen; $i++) {
            // check for how many times this rule matches
            if (!$rule($ord, $data, $i)) {
                // save the shortest result (so we can compress)
                if (strlen($r = ($nr .'.'. $key .'.'. $i .' - '. run(substr($data, $i), $rules))) < strlen($result)
                || !$result) {
                    $result = $r;
                }
                continue 2; // we are going to next rule
            }
        }

        // if comes here, it means entire $data follow this rule ($key)
        if (strlen($r = (($nr .'.'. $key .'.'. $i))) < strlen($result)
        || !$result) {
            $result = $r; // entire data follow this $rule
        }
    }

    return $result; // it will return the shortest result it got
}

// ASCII compressor
function compress (string $data): string {
    $rules = array( // ASCII rules
        1 => function (int $ord, string $data, int $i): bool { // same char
            return ($ord == ord($data[$i]));
        },
        2 => function (int $ord, string $data, int $i): bool { // linear growth
            return (($ord+$i) == ord($data[$i]));
        },
        3 => function (int $ord, string $data, int $i): bool { // progressive growth
            return ((ord($data[$i-1])+$i) == ord($data[$i]));
        },
        4 => function (int $ord, string $data, int $i): bool { // linear decrease
            return (($ord-$i) == ord($data[$i]));
        },
        5 => function (int $ord, string $data, int $i): bool { // progressive decrease
            return ((ord($data[$i-1])-$i) == ord($data[$i]));
        }
    );

    // we use base64_encode because we want only ASCII chars
    return run(base64_encode($data), $rules);
}

I added dots ('.') and dashes ('-') only for testing easily.

Results

compress("ana ar") => "089.1.1 - 087.1.1 - 053.1.1 - 104.1.1 - 073.1.1 - 071.4.2 - 121.1.01"

Which is ok. And it runs fast. Without a problem.

compress("ana aros") => Fatal error: Maximum execution time of 15 seconds exceeded

If string is a bit longer, it gets toooo much. It works fast and normal for 1-7 chars. But when there are more chars in string, that happens.

The algorithm doesn't run perfect and doesn't return the perfect 6-digit pattern, indeed. Before getting there, I'm stucked with that.

Question

How I can increase performance of this backtracking for running ok now and also with more rules?

What's your goal? Finding minimal possible compression is NP-hard. Without any restrictions your code can't be optimized. If I were you I'd zip initial string, and then converted it to digits the same way it's converted to `base64`, or, if it looks simpler to you, converted it to `base64` and then converted it to digits **without any extra compression**. Making your own compression algorithms is hard and I doubt you'll be able to beat this method easily. Also may be interesting to you: https://stackoverflow.com/questions/3261685/what-is-the-maximum-theoretically-possible-compression-rate — x00, Jul 11 '20 at 10:40
My goal is to have a only digits result. What does NP-hard means? What restrictions I could apply to optimize my code? What do you mean by converting it to digits, with what method? — Valentin Tanasescu, Jul 11 '20 at 16:51
By the goal I mean: what is the reason to "have a only digits result" and to write your own compression algorithm? http://en.wikipedia.org/wiki/NP-hardness Some say "the hardest, most complex problems in computer science" Haven't checked it, but this looks like a function suitable for conversion https://stackoverflow.com/a/4848526/12610347 — x00, Jul 11 '20 at 18:30
@ValentinTanasescu, how is the result of running `compress("ana ar")` okay? What is the meaning of the results? It has 7 patterns, while the input is only 6 characters. — noam, Jul 12 '20 at 10:22
"The algorithm [doesn't run perfect] and doesn't return the perfect 6-digit pattern, indeed. Before getting there, I'm stucked with that." I want to use it for large strings and with more rules. — Valentin Tanasescu, Jul 12 '20 at 10:24
I suggest you change your code to count in each iteration the longest matches thus far. This way you can later decide to take the longest match for the output. — noam, Jul 12 '20 at 10:32
@ValentinTanasescu, can you explain in human language what "progressive increase" and "progressive decrease" mean? — noam, Jul 12 '20 at 10:36
With 'progressive' I mean the increase/decrease isn't always with 1. First is with 1, second is with 2, etc... — Valentin Tanasescu, Jul 12 '20 at 10:39
@Noam, I don't know how to do your advice about changing the code. Can you help me please? — Valentin Tanasescu, Jul 12 '20 at 10:40
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/217685/discussion-between-valentin-tanasescu-and-noam). — Valentin Tanasescu, Jul 12 '20 at 11:03
I just wanted to point out that using `base64_encode()` will pretty much guarantee that your results will always have the pattern `x.1.y` as your own results show. You aren't converting to ASCII, you are re-encoding in a more verbose plane and actually increasing the size of your strings, first. For instance, `ana ar` is 6 characters but in base64 it becomes `YW5hIGFy` which is 8 characters. You second example is 8 characters normally but 12 when base64 encoded. — Chris Haas, Jul 14 '20 at 13:24
Each ASCII char is 7bit so it means you have 1 bit free in each byte! did you think you can use this free bit ?! the idea is you can use the next char byte bit in previous free bit, so in next byte you have 2 free bit and it means in every 7 ASCII char you will save one char! for example in 200 char you will save 28 char! which is mean compressed string is only have length 172 char long — ermya, Jul 17 '20 at 02:54

Ext3h · Answer 1 · 2020-07-16T08:55:27.180

Searching for gradients / infix repetitions is not a good match for compressing a natural language. Natural language is significantly easier to compress using a dictionary based approach (both dynamic dictionaries bundled with the compressed data, as well as pre-compiled dictionaries trained on a reference set work), as even repeating sequences in ASCII encoding usually don't follow any trivial geometric pattern, but appear quite random when observing only the individual characters ordinal representations.

That said, the reason your algorithm is so slow, is because you are exploring all possible patterns, which results in a run time exponential in the input length, precisely O(5^n). For your self-set goal of finding the ideal compression in a set of 5 arbitrary rules, that's already as good as possible. If anything, you can only reduce the run time complexity by a constant factor, but you can't get rid of the exponential run time. In other terms, even if you apply perfect optimizations, that only makes the difference of increasing the maximum input length you can handle by maybe 30-50%, before you inevitably run into timeouts again.

@noam's solution doesn't even attempt to find the ideal pattern, but simply greedily uses the first matching pattern to consume the input. As a result it will incorrectly ignore better matches, but in return it also has only to look at each input character once only, resulting in a linear run time complexity O(n).

Of course there are some details in your current solution which make it a lot easier to solve, just based on simple observations about your rules. Be wary though that these assumptions will break when you try to add more rules.

Specifically, you can avoid most of the backtracking if you are smart about the order in which you try your rules:

Try to start a new geometric pattern ord(n[i])=ord(n[0])+i first, and accept as match only when it matched at least 3 characters ahead.
Try to continue current geometric pattern.
Try to continue current gradient pattern.
Try to start new gradient ord(n[i])=ord(n[0])+i, and accept as match only when it matched at least 2 characters ahead.
Try to start / continue simple repetition last, and always accept.

Once a character from input was accepted by any of these rules (meaning it has been consumed by a sequence), you will no longer need to backtrack from it or check any other rule for it, as you have already found the best possible representation for it. You still need to re-check the rules for every following character you add to to the sequence, as a suffix of the gradient rule may be required as the prefix for a geometric rule.

Generically speaking, the pattern in your rule set which allows this, is the fact that for every rule with a higher priority, no match for that rule can have a better match in any following rule. If you like, you can easily prove that formally for every pair of possible rules you have in your set.

If you want to test your implementation, you should specifically test patterns such as ABDHIK. Even though H is a match the currently running geometric sequence ABDH, using it as the starting point of the new geometric sequence HIK is unconditionally the better choice.

noam · Accepted Answer · 2020-07-13T16:15:35.653

I came up with a initial solution to your problem. Please note:

You will never get a sequence of just one letter, because each 2 consecutive letters are a "linear growth" with a certain difference.
My solution is not very clean. You can, for example combine $matches and $rules to a single array.
My solution is naive and greedy. For example, in the example adeflk, the sequence def is a sequence of 3, but because my solution is greedy, it will consider ad as a sequence of 2, and ef as another sequence of 2. That being said, you can still improve my code.
The code is hard to test. You should probably make use of OOP and divide the code to many small methods that are easy to test separately.

<?php

function compress($string, $rules, $matches) {
    if ($string === '') {
        return getBestMatch($matches);
    }
    $currentCharacter = $string[0];

    $matchFound = false;

    foreach ($rules as $index => &$rule) {
        if ($rule['active']) {
            $soFarLength = strlen($matches[$index]);
            if ($soFarLength === 0) {
                $matchFound = true;
                $matches[$index] = $currentCharacter;
            } elseif ($rule['callback']($currentCharacter, $matches[$index])) {
                $matches[$index] .= $currentCharacter;
                $matchFound = true;
            } else {
                $rule['active'] = false;
            }
        }
    }

    if ($matchFound) {
        return compress(substr($string, 1), $rules, $matches);
    } else {
        return getBestMatch($matches) . startNewSequence($string);
    }
}

function getBestMatch($matches) {

    $rule = -1;
    $length = -1;
    foreach ($matches as $index => $match) {
        if (strlen($match) > $length) {
            $length = strlen($match);
            $rule = $index;
        }
    }
    if ($length <= 0) {
        return '';
    }
    return ord($matches[$rule][0]) . '.' . $rule . '.' . $length . "\n";
}

function startNewSequence($string) {
    $rules = [
        // rule number 1 - all characters are the same
        1 => [
            'active' => true,
            'callback' => function ($a, $b) {
                return $a === substr($b, -1);
            }
        ],
        // rule number 2 - ASCII code of current letter is one more than the last letter ("linear growth")
        2 => [
            'active' => true,
            'callback' => function ($a, $b) {
                return ord($a) === (1 + ord(substr($b, -1)));
            }
        ],
        // rule number 3 - ASCII code is a geometric progression. The ord() of each character increases with each step.
        3 => [
            'active' => true,
            'callback' => function ($a, $b) {
                if (strlen($b) == 1) {
                    return ord($a) > ord($b);
                }
                $lastCharOrd = ord(substr($b, -1));
                $oneBeforeLastCharOrd = ord(substr($b, -2, 1));
                $lastDiff = $lastCharOrd - $oneBeforeLastCharOrd;
                $currentOrd = ord($a);

                return ($currentOrd - $lastCharOrd) === ($lastDiff + 1);
            }
        ],
        // rule number 4 - ASCII code of current letter is one less than the last letter ("linear decrease")
        4 => [
            'active' => true,
            'callback' => function ($a, $b) {
                return ord($a) === (ord(substr($b, -1)) - 1);
            }
        ],
        // rule number 5 - ASCII code is a negative geometric progression. The ord() of each character decreases by one
        // with each step.
        5 => [
            'active' => true,
            'callback' => function ($a, $b) {
                if (strlen($b) == 1) {
                    return ord($a) < ord($b);
                }
                $lastCharOrd = ord(substr($b, -1));
                $oneBeforeLastCharOrd = ord(substr($b, -2, 1));
                $lastDiff = $lastCharOrd - $oneBeforeLastCharOrd;
                $currentOrd = ord($a);

                return ($currentOrd - $lastCharOrd) === ($lastDiff - 1);
            }
        ],
    ];

    $matches = [
        1 => '',
        2 => '',
        3 => '',
        4 => '',
        5 => '',
    ];

    return compress($string, $rules, $matches);
}

echo startNewSequence('tsrqpozh');

I tried hard to understand your solution and test it. Using echo in a function makes code hard to test. Can't you start from what I've tried and use it, please? — Valentin Tanasescu, Jul 13 '20 at 15:43
I edited my code to return the value instead of echoing it. I cannot really use your code. You said yourself that it is slow (I imagine an infinite loop or something) and does not yield correct results. — noam, Jul 13 '20 at 16:16

Compressing string using ASCII char codes using backtracking in php

Theory

What I've tried

Results

Question

2 Answers2