Modular run-length encoding

Question

Question

How to implement a run length encoding modulus n>=1? For n=4, considering the inputAAABBBBABCCCCBBBDAAA, we want an output of [('D', 1), ('A', 3)]. Note the long-distance merging due to the modulus operation. See Explanation.

Explanation

The first occurance of BBBB encodes to (B, 4) whose modulus 4 is (B, 0), thus canceling itself out. See the diagram (ignore spaces; they are simply for illustrative purposes):

AAABBBBABCCCCBBBDAAA
A3  B4 ABCCCCBBBDAAA
A3  B0 ABCCCCBBBDAAA
A3     ABCCCCBBBDAAA
A4      BCCCCBBBDAAA
A0      BCCCCBBBDAAA
        BCCCCBBBDAAA
        ... 
                DA3

A simpler example when no merging happens since none gets canceled by modulus 4: input AAABABBBC produces output [('A',3),('B',1),('A',1),('B',3),('C',1)].

Requirements

Haskell implementations are preferred but others are welcome too!
Prefer standard/common library functions over 3rd party libraries.
Prefer readable and succint programs utilizing higher-order functions.
Prefer efficiency (do not loop over the whole list whenever unnecessary)

My program

I implemented this in Haskell, but it looks too verbose and awful to read. The key idea is to check three tuples at a time, and only advance one tuple forward if we can neither cancel out 0 tuples nor merge a pair of tuples among the three tuples at hand.

import Data.List (group)

test = [('A', 1), ('A', 2), ('B', 2), ('B', 2), ('A', 1), ('B', 1), ('C', 1), ('C', 3), ('B', 3), ('D', 1), ('A', 3)] :: [(Char, Int)]
expected = [('D', 1), ('A', 3)] :: [(Char, Int)]


reduce' :: [(Char, Int)] -> [(Char, Int)]
reduce' [           ] = []                                           -- exit
reduce' (   (_,0):xs) = reduce' xs
reduce' (x1:(_,0):xs) = reduce' (x1:xs)
reduce' (   (x,n):[]) = (x,n):[]                                     -- exit

reduce' (        (x1,n1):(x2,n2):[])              -- [previous,current,NONE]
  | x1 == x2  = reduce' ((x1, d4 (n1+n2)):[])
  | otherwise = (x1,n1):(                                         -- advance
                reduce' ((x2, d4     n2 ):[]))

reduce' ((x1,n1):(x2,n2):(x3,n3):xs)              -- [previous,current,next]
  | n3 == 0   = reduce' ((x1, d4  n1    ):(x2, d4  n2    ):xs)
  | n2 == 0   = reduce' ((x1, d4  n1    ):(x3, d4     n3 ):xs)
  | x2 == x3  = reduce' ((x1, d4  n1    ):(x2, d4 (n2+n3)):xs)
  | x1 == x2  = reduce' ((x2, d4 (n1+n2)):(x3, d4     n3 ):xs)
  | otherwise = (x1,n1):(                                         -- advance
                reduce' ((x2, d4  n2    ):(x3, d4     n3 ):xs)
                )

-- Helpers
flatten :: [(Char, Int)] -> String
flatten nested = concat $ (\(x, n) -> replicate n x) <$> nested

nest :: String -> [(Char, Int)]
nest flat = zip (head <$> xg) (d4 .length <$> xg)
  where xg = group flat

reduce = reduce' . nest
d4 = (`rem` 4)

Thoughts

My inputs are like the test variable in the snipped above. We could keep doing flatten then nest until its result doesn't change, and would definitely look simpler. But it feels it is scanning the whole list many times, while my 3-pointer implementation scans the whole list only once. Maybe we can pop an element from left and add it to a new stack while merging identical consecutive items? Or maybe use Applicative Functors? E.g. this works but not sure about its efficiency/performance: reduce = (until =<< ((==) =<<)) (nest . flatten).

What is the question here? It seems you're inviting to a discussion, more than asking a question. — Lasse V. Karlsen, Dec 27 '19 at 21:41
i'm asking for a better implementation than the one i privided, and sharing some of my thoughts as some may find contexts like that helpful. — hyiltiz, Dec 27 '19 at 21:58
Shall I remove my implementation and thoughts and only leave the question? — hyiltiz, Dec 27 '19 at 22:23
FWIW I think this is a reasonable question. People ask all the time "How do I do x?", and we ask in return "What have you tried?". This question asks how to do x, and says "Here is what I tried: I got an implementation that is clearly awful, and am looking for a good one instead." — amalloy, Dec 27 '19 at 22:34
Personally, I would implement this by using `group` (from `Data.List`) to group the input (e.g. `group "AAABBBBAB = ["AAA","BBBB","A","B"]`), and then map `fst &&& ((\`mod\` 4) . length)` over that to get the output (using `&&&` from `Control.Arrow`). Full program: `fmap (fst &&& ((\`mod\` 4) . length)) . group`. — bradrn, Dec 27 '19 at 23:29
Explanation: `group` groups the characters to get a list of substrings. Then `fmap` applies a function to each element of that list (i.e. on each substring), that function being `fst &&& ((\`mod\` 4) . length)`. This gets the first character of the substring (`fst`), gets the length mod 4 `((\`mod\` 4) . length)`, and puts both in a tuple (`&&&`). — bradrn, Dec 27 '19 at 23:35
Just tried @bradrn's solution after `:module + Control.Arrow` in GHCi only to get a type error. Besides, can it handle the long distance merging (in above example, the first `A` doesn't get `group`ed until the next `B` is merged then canceled)? — hyiltiz, Dec 27 '19 at 23:39
Sorry @hyiltiz — I didn’t notice the long distance merging. That makes this problem a lot more interesting! And I don’t see any easy solution once that’s taken into account. — bradrn, Dec 27 '19 at 23:43
You should include an example where merging does not happen, e.g. `AAABABBBC` does not yield `[('C',1)]` (right?). Otherwise, we could simply disregard the order of the characters, sort the strings, and compute from there. — chi, Dec 28 '19 at 00:33
This is really similar to the [Zuma game](https://leetcode.com/problems/zuma-game/) problem. — MikaelF, Dec 28 '19 at 05:10

amalloy · Accepted Answer · 2019-12-28T08:17:34.660

Algorithm

I think you are making this problem much harder by thinking of it in terms of character strings at all. Instead, do a preliminary pass that just does the boring RLE part. This way, a second pass is comparatively easy, because you can work in "tokens" that represent a run of a certain length, instead of having to work one character at a time.

The only data structure we need to maintain as we do the second pass through the list is a stack, and we only ever need to look at its top element. We compare each token that we're examining with the top of the stack. If they're the same, we blend them into a single token representing their concatenation; otherwise, we simply push the next token onto the stack. In either case, we reduce token sizes mod N and drop tokens with size 0.

Performance

This algorithm runs in linear time: it processes each input token exactly once, and does a constant amount of work for each token.
It cannot produce output lazily. There is no prefix of the input that is sufficient to confidently produce a prefix of the output, so we have to wait till we have consumed the entire input to produce any output. Even something that "looks bad" like ABCABCABCABCABC can eventually be cancelled out if the rest of the string is CCCBBBAAA....
The reverse at the end is a bummer, but amortized over all the tokens it is quite cheap, and in any case does not worsen our linear-time guarantee. It likewise does not change our space requirements, since we already require O(N) space to buffer the output (since as the previous note says, it's never possible to emit a partial result).

Correctness

Writing down my remark about laziness made me think of your reduce solution, which appears to produce output lazily, which I thought was impossible. The explanation, it turns out, is that your implementation is not just inelegant, as you say, but also incorrect. It produces output too soon, missing chances to cancel with later elements. The simplest test case I can find that you fail is reduce "ABABBBBAAABBBAAA" == [('A',1),('A',3)]. We can confirm that this is due to yielding results too early, by noting that take 1 $ reduce ("ABAB" ++ undefined) yields [(1, 'A')] even though elements might come later that cancel with that first A.

Minutiae

Finally note that I use a custom data type Run just to give a name to the concept; of course you can convert this to a tuple cheaply, or rewrite the function to use tuples internally if you prefer.

Implementation

import Data.List (group)

data Run a = Run Int a deriving Show

modularRLE :: Eq a => Int -> [a] -> [Run a]
modularRLE groupSize = go [] . tokenize
  where go stack [] = reverse stack
        go stack (Run n x : remainder) = case stack of
          [] -> go (blend n []) remainder
          (Run m y : prev) | x == y -> go (blend (n + m) prev) remainder
                           | otherwise -> go (blend n stack) remainder
          where blend i s = case i `mod` groupSize of
                              0 -> s
                              j -> Run j x : s
        tokenize xs = [Run (length run) x | run@(x:_) <- group xs]
λ> modularRLE 4 "AAABBBBABCCCCBBBDAAA"
[Run 1 'D',Run 3 'A']
λ> modularRLE 4 "ABABBBBAAABBBAAA"
[]

score 2 · Answer 2 · answered Dec 28 '19 at 05:56

My first observation will be that you only need to code one step of the resolution, since you can pass that step to a function that will feed it its own output until it stabilizes. This function was discussed in this SO question and was given a clever answer by @galva:

--from https://stackoverflow.com/a/23924238/7096763
converge :: Eq a => (a -> a) -> a -> a
converge = until =<< ((==) =<<)

This is the entrypoint of the algorithm:

--               |-------------step----------------------|    |---------------process------|   
solve = converge (filter (not . isFullTuple) . collapseOne) . fmap (liftA2 (,)  head length) . group

Starting from the end of the line and moving backwards (following the order of execution), we first process a String into a [(Char, Int)] using fmap (liftA2 (,) head length) . group. Then we get to a bracketed block that contains our step function. The collapseOne takes a list of tuples and collapses at most one pair of tuples, deleting the resulting tuple if necessary (if mod 4 == 0) ([('A', 1), ('A', 2), ('B', 2)] ==> [('A', 3), ('B', 2)]):

collapseOne [x] = [x]
collapseOne [] = []
collapseOne (l:r:xs)
  | fst l == fst r = (fst l, (snd l + snd r) `mod` 4):xs
  | otherwise          = l:(collapseOne (r:xs))

You also need to know if tuples are "ripe" and need to be filtered out:

isFullTuple = (==0) . (`mod` 4) . snd

I would argue that these 8 lines of code are significantly easier to read.

Glad to see a similar solution to the one I included in **Thoughts**. — hyiltiz, Dec 28 '19 at 05:59