2

I'm generating ~1million text files containing arrays of doubles, tab delimited (these are simulations for research). Example output below. Each million text files I expect to be ~5 TB, which is unacceptable. So I need to compress.

However, all my data analysis will be done in matlab. And every matlab script will need to access all million of these text files. I can't decompress the whole million using C++, then run the matlab scripts, because I lack the HD space. So my question is, are there some very simple, easy to implement algorithms or other ways of reducing my text file sizes so that I can write the compression in C++ and read it in matlab?

example text file

0.0220874   0.00297818  0.000285954 1.70E-05    1.52E-07
0.0542912   0.00880725  0.000892849 6.94E-05    4.51E-06
0.0848582   0.0159799   0.00185915  0.000136578 7.16E-06
0.100415    0.0220033   0.00288016  0.000250445 1.38E-05
0.101889    0.0250725   0.00353148  0.000297856 2.34E-05
0.0942061   0.0256  0.00393893  0.000387219 3.01E-05
0.0812377   0.0238492   0.00392418  0.000418365 4.09E-05
0.0645259   0.0206528   0.00372185  0.000419891 3.23E-05
0.0487525   0.017065    0.00313825  0.00037539  3.68E-05

If it matters.. the complete text files represent joint probability mass functions, so they sum to 1. And I need lossless compression.

UPDATE Here is an IDIOTS guide to writing binary in C++ and reading it Matlab, with some very basic explanation along the way.

C++ code to write a small array to a binary file.

#include <iostream>
using namespace std;
int main()
{
    float writefloat;
    const int rows=2;
    const int cols=3;
    float JPDF[rows][cols];
    JPDF[0][0]=.19493;
    JPDF[0][1]=.111593;
    JPDF[0][2]=.78135;
    JPDF[1][0]=.33333;
    JPDF[1][1]=.151535;
    JPDF[1][2]=.591355;

JPDF is an array of type float that I save 6 values to. It's a 2x3 array.

    FILE * out_file;
    out_file = fopen ( "test.bin" , "wb" );

To be honest, I don't quite get what the first line is doing. It seems to be making a pointer of type FILE named out_file. The second line fopen says make a new file for writing (the 'w' of the second parameter), and make it a binary file (the 'b' of the wb).

    fwrite(&rows,sizeof(int),1,out_file);
    fwrite(&cols,sizeof(int),1,out_file);

Here I encode the size of my array (# rows, # cols). Note that we fwrite the reference to the variables rows and cols, not the variables themselves (& is by ref). The second parameter tells it how many bytes to write. Since rows and cols are both ints, I use sizeof(int). The '1' says do this 1 time. I think. And out_file is our pointer to the file we're writing to.

    for (int i=0; i<3; i++)
    {
        for (int j=0; j<2; j++)
        {
            writefloat=JPDF[j][i];
            fwrite (&writefloat , sizeof(float), 1, out_file);
        }
    }
    fclose (out_file);

return 0;
}

Now I'll iterate through my array and write each value in bytes to my file. The indexing is a little backwards looking in that I'm iterating down each column rather than across a column in the inner loop. We'll see why in a sec. Again, I'm writing the reference to writefloat, which takes on the value of the current array element in each iteration. Since each array element is a float, I'm using sizeof(float) here instead of sizeof(int).

Just to be incredibly, stupidly clear, here's a diagram of how I think of the file we've just created.

[4 bytes: rows][4 bytes: cols][4 bytes: JPDF[0][0]][4 bytes: JPDF[1][0]] ...
[4 bytes: JPDF[1][2]]

..where each chunk of bytes is written in binary (0s and 1s).

To interpret in MATLAB:

FID=fopen('test.bin');
sizes=fread(FID,2,'int')

FID sort of works like a pointer here. Secretly, it probably is a pointer. Then we use fread which operates very similarly to C++ fread. FID is our 'pointer' to our file. The 'int' tells the function how many bytes each chunk contains. So sizes=fread(FID,2,'int') says 'open FID in binary, and read 2 chunks of size INT bytes, and return the 2 elements in vector form. Now, sizes(1)=rows, and sizes(2)=cols.

s=fread(FID,[sizes(1) sizes(2)],'float')

The next part wasn't completely clear to me originally, I thought I'd have to tell fread to skip the 'header' of my binary that contains row/col info. However, it secretly maintains a pointer to where you left off. So now I empty the rest of the binary file, using the fact that I know the dimensions of the array. Note, while the second parameter [M,N] is [rows,cols], fread reads in "column order", which is why we wrote the array data in column order.

The one * is that I think I can only use matlab code 'int' and 'float' if the architecture of the C++ program is concordant with matlab (e.g., both are 64-bit, or both are 32-bit). But I'm not sure about this.

The output is:

sizes =

     2
     3

s =

         0.194930002093315         0.111593000590801         0.781350016593933
         0.333330005407333         0.151535004377365          0.59135502576828
vector07
  • 339
  • 1
  • 3
  • 12
  • With respect to "I need lossless", by virtue of the fact that you are printing floating point numbers to some precision, in fact varying levels of precision, you are already in a state of sin. To what precision do you actually need the numbers? That will determine how much you can compress them beyond simply encoding them as binary floating point values. – Mark Adler Sep 11 '12 at 04:19
  • Gosh, I don't know. I think it's variable. I'm constructing very heterogeneous joint probability mass functions. If it's 5x5, I hardly need much in the way of precision, probably only a few decimal places. But if its 100x50,000, I'd need ~10-15 significant digits. I'm not sure if that satisfactorily answers your question. – vector07 Sep 11 '12 at 17:58
  • I think maybe I see what you're getting at, after reading how a float is actually stored (http://en.wikipedia.org/wiki/Single_precision). If I wanted say 10 significant digits plus the order (decimal offset), I could probably make a new binary number that encodes my values and requires much less than 4 bytes. Is that what you're getting at? – vector07 Sep 11 '12 at 19:25
  • Yes. Since you know the number is between 0 and 1, you can dispense with the ability to store numbers between 10^-38 to 10^38 and the associated waste of space. Pick a power of two (call it 2^n, multiply each probability by that, round, and then store n bits. – Mark Adler Sep 11 '12 at 22:50
  • A four-byte float will only store about six decimal digits of mantissa, which by the way is what you printed in your example. If you really need 10 digits, then you would need to use a double instead of a float (8 bytes instead of 4 bytes), which will store about 15 decimal digits. If you need probabilities with 10 decimal digits, then you can multiply by 2^34 and store 34 bits per probability. That's less than the eight bytes needed for a double. – Mark Adler Sep 11 '12 at 22:52

3 Answers3

3

To do better than four bytes per number, you need to determine to what precision you need these numbers. Since they are probabilities, they are all in [0,1]. You should be able to specify a precision as a power of two, e.g. that you need to know each probability to within 2-n of the actual. Then you can simply multiply each probability by 2n, round to the nearest integer, and store just the n bits in that integer.

In the worst case, I can see that you are never showing more than six digits for each probability. You can therefore code them in 20 bits, assuming a constant fixed precision past the decimal point. Multiply each probability by 220 (1048576), round, and write out 20 bits to the file. Each probability will take 2.5 bytes. That is smaller than the four bytes for a float value.

And either way is way smaller than the average of 11.3 bytes per value in your example file.

You can get better compression even than that if you can exploit known patterns in your data. Assuming that there are any. I see that in your example, on each line the values go down by some factor at each step. If that is real and not just an artifact of the generation of the example, then you can successively use fewer bits for each sample. Also if the first sample is really always less than 1/8, then you can drop the top three bits off that one, since those bits would always be zero. If the second column is always less than 1/32, you can drop the first five bits off all of those. And so on. Assuming that the magnitudes in the example are maximums across all of the data sets (obviously not true, but just using that as an illustrative case), and assuming you need six decimal digits after the decimal point, I could code each row of six values in 50 bits, for an average of a little over one byte per probability.

And for one last smidgen of compression, since the values add to one, you don't have store the last value.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • Sorry, I responded above without seeing your longer message below. This is really neat. So, there's no pattern that will hold across runs in the way you describe. However, it seems I should be able to define a custom float for my problem. The number of 0's will probably never exceed 15. So I could use say the first 4 bits to determine the number of zeros between the decimal and the start of the number (0-15), and the remaining 6 digits (2^20) after those zeroes. This would require 24 bits, or 3 bytes, a 25% reduction. – vector07 Sep 11 '12 at 23:36
  • This gives me newfound appreciation for why there are little bits (excuse the pun) of numbers that seem to accumulate when doing numerical simulations. There must always be some arbitrarily set precision in order to do arithmetic due to the very nature of how numbers are represented in binary! – vector07 Sep 11 '12 at 23:42
  • Why do you want to hold on to so much precision for tiny numbers, when that precision is swamped by the uncertainty in larger (closer to one) numbers? For example, consider adding two probabilities. If you have 0.100415, you can consider that to be 0.100415xxxxxx where the x's are unknown digits and represent uncertainty in the value. If you add to that 0.0000000152667, then you are adding the precision in the second number to noise in the first number, making all that precision useless. A fixed precision after the decimal point makes more sense than a custom floating point format. – Mark Adler Sep 12 '12 at 05:26
  • Good point. Maybe I need more precision than 6 digits. I looked at a larger simulation, and found that values ranged from 10^-6 to 10^-13. In a 6 digit fixed point scheme the lowest numbers would all be rounded to 1 digit. But there are 12000 points in the distribution less than 10^-12. If they're rounded to 1-2 digits, I can't imagine the distribution will ever sum to 1. So I guess I need more precision than I thought. – vector07 Sep 12 '12 at 18:43
  • With floating point arithmetic it is unlikely for the probabilities to add up to exactly one, regardless of the precision. Consider the simple case of adding 0.333... three times. Wherever you stop writing the 3's, it never adds up to one. You have to accept round-off error in any case. – Mark Adler Sep 12 '12 at 19:15
2

Matlab can read binary files. Why not save your files as binary instead of text?

Saving each number as a float would only require 4 bytes (if you're running 32 bit linux), you could use doubles but it appears that you aren't using the full double resolution. Under your current scheme each digit every number consumes a byte of space. All of your numbers are easily 4+ char longs, some as long as 10 chars. Implementing this change should cut down your file sizes by more than 50%.

Additionally you might consider using a more elegant data format like HDF5 (more here) that both supports compression and is supported by matlab

Update:

There are lots of examples of how to write binary files in C++, just google it. Additionally to read in a binary file in Matlab simply use fread

The difference between representing a number as ascii vs binary is really simple. All files are written using binary, the difference is in how that information gets interpreted. Text files are generally read using ASCII, which provides a nice mapping between an 8bit word and characters. When you see a string like "255" what you have is a array of bytes where each byte encodes on character in the array. However when you are storing numbers its really wasteful to store each digit of using a different byte. A single byte can store values between 0-255. So why use three bytes to store the string "255" when I can use a single byte to store the value 255.

Community
  • 1
  • 1
slayton
  • 20,123
  • 10
  • 60
  • 89
  • These suggestions sound like what I'm looking for. I have a dumb follow-up: how would I get started writing and reading binary files? And for bonus, can you explain why binary will end up being smaller than the way I'm doing it now? – vector07 Sep 10 '12 at 20:15
  • @vector07 Basically, writing decimal numbers in 8-bit ASCII uses less than 4 bits of information of an 8-bit byte (less than 50% of what's available). Writing dots and empty spaces wastes still more space. – nrz Sep 10 '12 at 21:53
0

You can always go ahead and zip everything using a standard library like zlib. Afterwards you could use a custom dll written in C++ that unzips your data in chunks you can manage. So basically: Data --> Zip --> Dll (Loaded by Matlab via LoadLibrary) --> Matlab

rincewound
  • 328
  • 1
  • 6