8

Background: I have an implementation of a generic LZSS backend on C++ (available here. The matching algorithm I use in this version is exceedingly simple, because it was originally meant to compress relatively small files (at most 64kB) for relatively ancient hardware (specifically, the Mega Drive/Sega Genesis, where 64kB is the entirety of main RAM).

Nevertheless, some files take far too long to compress on my implementation, on the order of minutes. The reason is twofold: the naïve matching algorithm takes most of the time, but this happens specifically because I construct a compression graph from the file to achieve optimal compression. Looking on the profiler, most of the time is spent looking for matches, dwarfing even the quadratic size of the resulting graph.

For some time, I have been studying several potential replacements; one that drew my attention was dictionary-symbolwise flexible parsing using multilayer suffix trees. The multilayer part is important because one of the variants of LZSS I am interested in uses variable size encodings for (position, length).

My current implementation allows matches in the sliding window to overlap the look-ahead buffer, so that this input:

aaaaaaaaaaaaaaaa

can be directly encoded as

(0,'a')(1,0,15)

instead of

(0,'a')(1,0,1)(1,0,2)(1,0,4)(1,0,8)

Here, (0,'a') means encoding character 'a' as a literal, while (1,n,m) means 'copy m characters from position n'.

The question: Having said all that, here is my problem: Every resource I found on suffix trees seem to imply that they can't handle the overlapping case, and instead only allows you to find non-overlapping matches. When suffix trees were involved, research papers, books and even some implementations gave compression examples without the overlap as if they were perfect compression (I would link to some of these but my reputation does not allow it). Some them even mentioned that overlaps could be useful when describing the base compression schemes, but were strangely silent on the matter when discussing suffix trees.

Since the suffix tree needs to be augmented to store offset information anyway, this seems like a property that could be checked while looking for a match — you would filter out any matches that start on the look-ahead buffer. And the way the tree is constructed/updated would mean that if an edge takes you to a node that corresponds to a match starting on the look-ahead, you return the previous node instead as any further descendants will also be on the look-ahead buffer.

Is my approach wrong or incorrect? Is there an implementation or discussion of LZ77/LZSS with suffix trees that mentions matches overlapping the look-ahead buffer?

flamewing
  • 89
  • 7
  • I didn't understand the first part. If you have a large file you are supposed to cut it in to smaller chunks, so each chunk is no more than 64kb. Therefore your simple implementation should not slow down exponentially as the file gets bigger. – Barmak Shemirani Jul 10 '15 at 20:04
  • I don't think it is so much slowing down exponentially as it is triggering the worst case; the simple matching is *O* (*mn*) in the worst case (m = length of pattern, n = length of searched text), and this is executed for every byte of the file, so I end up with *O* (*mnp*) — this except on the start and end of the file. The LZSS variant in question has a search buffer of 8192 bytes (so *n* = 8192) and a look-ahead buffer of 256 bytes (so *m* = 256). For a file of length *p*, the worst case would be *O* (2097152 *p*), which is more-or-less what is observed. – flamewing Jul 10 '15 at 20:20
  • For fun ways of doing this from other sources: https://github.com/mist64/pucrunch is pretty dang good. More info here: http://koti.kapsi.fi/a1bert/Dev/pucrunch/ – Michael Dorgan Jul 10 '15 at 22:46
  • That is an interesting approach. The koti link is giving a 'not found' error, by the way (but I did find it on Achive.org). – flamewing Jul 11 '15 at 17:05

1 Answers1

3

As I understand it, given a suffix tree, we are interested (roughly) in computing, for each suffix S, which longer suffix has the longest common prefix with S.

Add a reference from each tree node to the descendant leaf with the longest suffix (linear time with DFS). From each leaf, walk rootward, examining the new references, stopping if a longer suffix is found. The running time of the latter step is linear, because each tree edge is examined exactly once.

Life with a bounded window is unfortunately more difficult. Instead of propagating one reference, we propagate several. To compute the set of suffixes referenced from a node, we merge them in sorted order by length. Then whenever we have suffixes of lengths x > y > z, if x - z < 8192 (e.g.), then we can drop y, because all three are equally good for all suffixes with which the current node is the leafmost common ancestor, and if y is in window, then either x or z is. Since the window is a large fraction of the total file, each node will have at most a handful of references.

If you want to look back the shortest distance possible, then there's an O(n log^2 n)-time algorithm (probably improvable to O(n log n) with various hard-to-implement magic). In the course of the algorithm, we construct, for each node, a binary search tree of the descendant suffixes by length, then do next-longest lookups. To construct the node's tree from its childrens', start with the largest child tree and insert the elements from the others. By a heavy path argument, each length is inserted O(log n) times.

David Eisenstat
  • 64,237
  • 7
  • 60
  • 120
  • The use of suffix trees for LZ77 and LZSS is well stabilished in the literature; fro example, [this paper](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.89.1447&rep=rep1&type=pdf) gives a procedure to delete the longest suffix, which is one of the steps needed to slide the window; using a slightly modified version of Ukkonen's algorithm then allows adding the next character and maintaining all in-tree data necessary to slide the window. That said, your idea is interesting and worth looking into as well. – flamewing Jul 11 '15 at 17:03
  • @flamewing Yeah, I didn't actually look, but I figured that they were doing something like that. That's why they don't talk about the overlapping case -- the data structure needed to do that hasn't been built yet if the mode of operation is online. – David Eisenstat Jul 11 '15 at 17:19