9

I'm using an event loop based server in twisted python that stores files, and I'd like to be able to classify the files according to their compressibility.

If the probability that they'd benefit from compression is high, they would go to a directory with btrfs compression switched on, otherwise they'd go elsewhere.

I do not need to be sure - 80% accuracy would be plenty, and would save a lot of diskspace. But since there is the CPU and fs performance issue too, I can not just save everything compressed.

The files are in the low megabytes. I can not test-compress them without using a huge chunk of CPU and unduly delaying the event loop or refactoring a compression algorithm to fit into the event loop.

Is there any best practice to give a quick estimate for compressibility? What I came up with is taking a small chunk (few kB) of data from the beginning of the file, test-compress it (with a presumably tolerable delay) and base my decision on that.

Any suggestions? Hints? Flaws in my reasoning and/or problem?

elpollodiablo
  • 135
  • 1
  • 4
  • 2
    Just to state the obvious, you didn't mention the compression algorithm you are planning to use. Having said that, I don't think there's nothing you can do without at least inspecting the file at least once – Alexander Oct 07 '12 at 15:07
  • Why can't you use progressive compression? – Ignacio Vazquez-Abrams Oct 07 '12 at 15:09
  • Compressing a small part won't help : if the rest of the file is just made from copies of this part it will be easy to compress. I'm afraid the only good solution is to try compressing the whole file. – Denys Séguret Oct 07 '12 at 15:09
  • 3
    How about saving everything to an uncompressed directory, then migrating things into compressed directories in a background thread? The migrator could check that acceptable compression has been achieved before publishing the migration back to the other threads. – Tom Anderson Oct 07 '12 at 15:26
  • 1
    Related: http://stackoverflow.com/questions/7027022/how-to-efficiently-predict-if-data-is-compressible – leonbloy Nov 11 '13 at 02:09

3 Answers3

12

Just 10K from the middle of the file will do the trick. You don't want the beginning or the end, since they may contain header or trailer information that is not representative of the rest of the file. 10K is enough to get some amount of compression with any typical algorithm. That will predict a relative amount of compression for the whole file, to the extent that that middle 10K is representative. The absolute ratio you get will not be the same as for the whole file, but the amount that it differs from no compression will allow you to set a threshold. Just experiment with many files to see where to set the threshold.

As noted, you can save time by doing nothing for files that are obviously already compressed, e.g. .png. .jpg., .mov, .pdf, .zip, etc.

Measuring entropy is not necessarily a good indicator, since it only gives the zeroth-order estimate of compressibility. If the entropy indicates that it is compressible enough, then it is right. If the entropy indicates that it is not compressible enough, then it may or may not be right. Your actual compressor is a much better estimator of compressibility. Running it on 10K won't take long.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • 1
    With my testdata 1K doesn't do it, but it seems like 10K are sufficient to give an estimate of what compression ratio can be reached with the whole file. But I'm still crunching numbers, so I'll get back to you :) – elpollodiablo Oct 07 '12 at 21:02
6

I think what you are looking for is How to calculate the entropy of a file?

This questions contains all kind of methods to calculate the entropy of the file (and by that you can get the 'compressibility' of a file). Here's a quote from the abstract of this article (Relationship Between Entropy and Test Data Compression Kedarnath J. Balakrishnan, Member, IEEE, and Nur A. Touba, Senior Member, IEEE):

The entropy of a set of data is a measure of the amount of information contained in it. Entropy calculations for fully specified data have been used to get a theoretical bound on how much that data can be compressed. This paper extends the concept of entropy for incompletely specified test data (i.e., that has unspecified or don't care bits) and explores the use of entropy to show how bounds on the maximum amount of compression for a particular symbol partitioning can be calculated. The impact of different ways of partitioning the test data into symbols on entropy is studied. For a class of partitions that use fixed-length symbols, a greedy algorithm for specifying the don't cares to reduce entropy is described. It is shown to be equivalent to the minimum entropy set cover problem and thus is within an additive constant error with respect to the minimum entropy possible among all ways of specifying the don't cares. A polynomial time algorithm that can be used to approximate the calculation of entropy is described. Different test data compression techniques proposed in the literature are analyzed with respect to the entropy bounds. The limitations and advantages of certain types of test data encoding strategies are studied using entropy theory

And to be more constructive, checkout this site for python implementation of entropy calculations of chunks of data

Community
  • 1
  • 1
zenpoy
  • 19,490
  • 9
  • 60
  • 87
  • Thanks for the literature! I didn't want to go the academic route, but it might be really interesting to do a test run with one or two entropy algorithms, compression of a small chunk of sample data, and compression of the whole file. I think I'll do that and come back with the results :) – elpollodiablo Oct 07 '12 at 16:01
  • Ok, so I gotta take away the checkmark again, because entropy (at least not the function linked, but I'm no mathematician, so what do I know about alternatives ;) is not the way to go. I'll put further test data online, but for now it looks like, actually using the compression algorithm on a small sample is more representative than a potential entropy correlation - which is way more fuzzy. – elpollodiablo Oct 07 '12 at 21:01
  • So you have to change the question :) you specifically asked for a way of knowing if a data will compress well without actually compressing it... – zenpoy Oct 07 '12 at 21:42
  • BTW - I just found this SO question: http://stackoverflow.com/questions/7027022/how-to-efficiently-predict-if-data-is-compressible it looks like you could try the which produced good results: 'ASCII checking', 'entropy calculation', and 'simplified compression'. – zenpoy Oct 07 '12 at 22:22
5

Compressed files usually don't compress well. This means that just about any media file is not going to compress very well, since most media formats already include compression. Clearly there are exceptions to this, such as BMP and TIFF images, but you can probably build a whitelist of well-compressed filetypes (PNGs, MPEGs, and venturing away from visual media - gzip, bzip2, etc) to skip and then assume the rest of the files you encounter will compress well.

If you feel like getting fancy, you could build feedback into the system (observe the results of any compression you do and associate the resulting ratio with the filetype). If you come across a filetype that has consistently poor compression, you could add it to the whitelist.

These ideas depend on being able to identify a file's type, but there are standard utilities which do a pretty good job of this (generally much better than 80%) - file(1), /etc/mime.types, etc.

Jean-Paul Calderone
  • 47,755
  • 6
  • 94
  • 122
  • This would be the best solution if the beginning of a file (and thus the mime-type) were a given, which it is not. It's more like chunks of arbitrary data that might often be compressible. – elpollodiablo Oct 07 '12 at 15:59
  • Surely you must have some way to discover the beginning of a file from any given chunk of the file - otherwise how do you reconstruct the entire file? But if you really can't do this, then I guess this approach is out of the question (it certainly makes more sense as a solution for a *file server* than for an *arbitrary chunks of files server* (your question made it sound like you were dealing with the former :). – Jean-Paul Calderone Oct 07 '12 at 16:33
  • I'm sorry, they really are deconstructed files as you guessed correctly, I should've included that to eliminate the mime-type possibility. The workflow doesn't allow for reconstruction on the fly, as it's a different part of the system that knows how to do that. – elpollodiablo Oct 07 '12 at 17:20