7

I have 2 large text files (csv, to be precise). Both have the exact same content except that the rows in one file are in one order and the rows in the other file are in a different order.

When I compress these 2 files (programmatically, using DotNetZip) I notice that always one of the files is considerably bigger -for example, one file is ~7 MB bigger compared to the other.-

My questions are:

How does the order of data in a text file affect compression and what measures can one take in order to guarantee the best compression ratio? - I presume that having similar rows grouped together (at least in the case of ZIP files, which is what I am using) would help compression but I am not familiar with the internals of the different compression algorithms and I'd appreciate a quick explanation on this subject.

Which algorithm handles this sort of scenario better in the sense that would achieve the best average compression regardless of the order of the data?

Icarus
  • 63,293
  • 14
  • 100
  • 115

4 Answers4

14

"How" has already been answered. To answer your "which" question:

The larger the window for matching, the less sensitive the algorithm will be to the order. However all compression algorithms will be sensitive to some degree.

gzip has a 32K window, bzip2 a 900K window, and xz an 8MB window. xz can go up to a 64MB window. So xz would be the least sensitive to the order. Matches that are further away will take more bits to code, so you will always get better compression with, for example, sorted records, regardless of the window size. Short windows simply preclude distant matches.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
11

In some sense, it is the measure of the entropy of the file defines how well it will compress. So, yes, the order definitely matters. As a simple example, consider a file filled with values abcdefgh...zabcd...z repeating over and over. It would compress very well with most algorithms because it is very ordered. However, if you completely randomize the order (but leave the same count of each letter), then it has the exact same data (although a different "meaning"). It is the same data in a different order, and it will not compress as well.

In fact, because I was curious, I just tried that. I filled an array with 100,000 characters a-z repeating, wrote that to a file, then shuffled that array "randomly" and wrote it again. The first file compressed down to 394 bytes (less than 1% of the original size). The second file compressed to 63,582 bytes (over 63% of the original size).

Mark Wilkins
  • 40,729
  • 5
  • 57
  • 110
  • Thanks. Are there any algorithms that you know of, more efficient in scenarios where the data is shuffled? – Icarus Feb 14 '13 at 18:44
  • I do not know for sure. The only one I have really played with much is zlib (which I think is similar to the deflate algorithm from old pkzip days ... I think). My guess is that generic algorithms would in generally be affected approximately the same by "shuffled" data. I suspect it would require a highly specific algorithm to recognize certain cases. For example, if you knew up front that your data would be shuffled in a certain way, you could preprocess it prior to zipping/compressing it. Then post-process after unzipping it. – Mark Wilkins Feb 14 '13 at 18:54
  • 1
    In your example, given that the characters a-z appear in equal numbers, but in random order, you need 5 bits instead of 8 to encode which character appears next. (Indeed, you have 26 characters, five bits is enough to encode 32 different values.) So for 100000 characters, you need 100000 * 5 / 8 = 62500 bytes. Your 63582 bytes file is probably just that, plus some metadata and a table to indicate what character is represented by each 5-bit value. For more details, look up "Huffman coding". – Kris Vandermotten May 07 '21 at 14:18
4

A typical compression algorithm works as follows. Look at a chunk of data. If it's identical to some other recently seen chunk, don't output the current chunk literally, output a reference to that earlier chunk instead.

It surely helps when similar chunks are close together. The algorithm will only keep a limited amount of look-back data to keep compression speed reasonable. So even if a chunk of data is identical to some other chunk, if that old chunk is too old, it could already be flushed away.

n. m. could be an AI
  • 112,515
  • 14
  • 128
  • 243
1

Sure it does. If the input pattern is fixed, there is a 100% chance to predict the character at each position. Given that two parties know this about their data stream (which essentially amounts to saying that they know the fixed pattern), virtually nothing needs to be communicated: total compression is possible (to communicate finite-length strings, rather than unlimited streams, you'd still need to encode the length, but that's sort of beside the point). If the other party doesn't know the pattern, all you'd need to do is to encode it. Total compression is possible because you can encode an unlimited stream with a finite amount of data.

At the other extreme, if you have totally random data - so the stream can be anything, and the next character can always be any valid character - no compression is possible. The stream must be transmitted completely intact for the other party to be able to reconstruct the correct stream.

Finite strings are a little trickier. Since finite strings necessarily contain a fixed number of instances of each character, the probabilities must change once you begin reading off initial tokens. One can read some sort of order into any finite string.

Not sure if this answers your question, but it addresses things a bit more theoretically.

Patrick87
  • 27,682
  • 3
  • 38
  • 73