File format optimized for sparse matrix exchange

Question

I want to save a sparse matrix of numbers (integers, but it could be floats) to a file for data exchange. For sparse matrix I mean a matrix where a high percentage of values (typically 90%) are equal to 0. Sparse in this case does not relate to the file format but to the actual content of the matrix.

The matrix is formatted in the following way:

        col1   col2   ....
row1  int1_1 int1_2   ....
row2  int2_1   ....   ....
....  ....     ....   ....

By using a text file (tab-delimited) the size of the file is 4.2G. Which file format, preferably ubiquitous such as a .txt file, can I use to easily load and save this sparse data matrix? We usually work with Python/R/Matlab, so formats that are supported by these are preferred.

How are you storing that sparse data? Note: if you use .txt or any other non-compressed data, then, obviously, the size won't change. — Ander Biguri, Feb 13 '18 at 16:10
Yep, txt is obviously the upper-bound for size, but it can be exchanged easily between different frameworks. I am looking for a compressed-like format that is supported by scientific libraries (e.g. pandas in python). — gc5, Feb 13 '18 at 16:15
I believe that all formats that can be saved/opened by MATLAB can be opened and saved by python — Ander Biguri, Feb 13 '18 at 16:15
`.txt` means plain text, but what text is stored exactly? A line for each entry of the matrix with row, column, value? — Luis Mendo, Feb 13 '18 at 16:25
If I recall correctly, a sparse array is stored as rows of 3 values (row_in_array,col_in_array,val) of nonzero elements. So basically you can just express your array in this format, using the smallest datatype that can contain every bit of info (some sort of `uint` for row/col and probably `double` for the values), then export this in whichever binary format you want, then import in the target software and use it to create a sparse array locally. — Dev-iL, Feb 13 '18 at 16:47
@Dev-iL I updated the definition of sparse matrix in the question. For sparse matrix I mean a matrix full of zeros, not a sparse matrix file format - the identification of which is actually the objective for this question. — gc5, Feb 13 '18 at 16:50
Why not dump the information into a file as json {[col,row]:element} including only non-zero entries and then read it in using the information in this answer https://stackoverflow.com/questions/2617600/importing-data-from-a-json-file-into-r — alanf, Feb 13 '18 at 16:58
@gc5 You might've misunderstood what I meant. The whole point of sparse matrices is that you only store (and perform operations) on nonzero elements. The way elements are stored is using 3 values per element. Your task is to encode this information in a file in a space-conserving way. If you use `uint` instead of `double` you might be able to save some space. If you choose to export it as a MATLAB `.mat` file (v7 or later) it will be compressed, saving more space. You don't have to export whatever representation the source software has for `sparse` arrays and hope it can be read later on. — Dev-iL, Feb 13 '18 at 17:00
@Dev-iL , yep sorry I misunderstood your first comment. Does the `.mat` format provide this sparse representation? — gc5, Feb 13 '18 at 17:03
@gc5 Yes, you can save a sparse Matlab matrix in a mat file, as you can any variable. However, a quick test shows the file size might not always be smaller. Using standard `save`, `eye(100)` and `sparse(eye(100))` are 257 and 364 bytes respectively. Whilst a 1000*1000 zeros matrix with one 1 is 3.63kB standard, and 215B when sparse. The advantage is dependent on *how sparse* your matrix is. Note there are [different compression options](https://uk.mathworks.com/help/matlab/import_export/mat-file-versions.html#br_4ten) — Wolfie, Feb 13 '18 at 17:42

gc5 · Answer 1 · 2018-02-14T01:53:21.620

2

I found the Feather format (which currently does not support Matlab, afaik).

Some comparison on reading and writing, and memory performance in Pandas is provided in this section.

It provides also support for the Julia language.

Edit:

I found that this format in my case uses more disk space than the .txt one, probably to increase performance in I/O. Compressing with zip alleviates the problem but compression during writing seems to not be supported yet.

edited Feb 14 '18 at 01:53

answered Feb 13 '18 at 17:00

gc5

9,468
24
90
151

How would a larger file size improve IO performance? – naught101 Mar 01 '21 at 05:21

Wybird666 · Answer 2 · 2018-02-14T01:27:00.327

You have several solutions, but generally what you need to do it output the indices of the non-zero elements as well as the values. Lets assume that you want to export to a single text file.

Generate array

Lets first generate a 10000 x 5000 sparse array with ~10% filled (it will be a bit less due to replicated indices):

N = 10000; 
M = 5000; 
rho = .1; 
rN = ceil(sqrt(rho)*N);
rM = ceil(sqrt(rho)*M);
S = sparse(N, M); 
S(randi(N, [rN 1]), randi(M, [rM 1])) = randi(255, rN, rM);

If your array is not stored as a sparse array, you can create it simply using (where M is the full array):

S = sparse(M);

Save as text file

Now we will save the matrix in the following format row_indx col_indx value row_indx col_indx value row_indx col_indx value

This is done by extracting the row and column indices as well as data values and then saving it to a text file in a loop:

[n, m, s] = find(S);
fid = fopen('Sparse.txt', 'wt');
arrayfun(@(n, m, s) fprintf(fid, '%d\t%d\t%d\n', n, m, s), n, m, s);
fclose(fid);

If the underlying data is not an integer, then you can use the %f flag on the last output, e.g. (saved with 15 decimal places)

arrayfun(@(n, m, s) fprintf(fid, '%d\t%d\t%.15f\n', n, m, s), n, m, s);

Compare this to the full array:

fid = fopen('Full.txt', 'wt'); 
arrayfun(@(n) fprintf(fid, '%s\n', num2str(S(n, :))), (1:N).'); 
fclose(fid);

In this case, the sparse file is ~50MB and the full file ~170MB representing a factor of 3 efficiency. This is expected since I need to save 3 numbers for every nonzero element of the array, and ~10% of the array is filled, requiring ~30% as many numbers to be saved compared to the full array.

For floating point format, the saving is larger since the size of the indices compared to the floating point value is much smaller.

In Matlab, a quick way to extract the data would be to save the string given by:

mat2str(S)

This is essentially the same but wraps it in the sparse command for easy loading in Matlab - one would need to parse this in other languages to be able to read it in. The command tells you how to recreate the array, implying you may need to store the size of the matrix in the file as well (I recommend doing it in the first line since you can read this in and create the sparse matrix before parsing the rest of the file.

Save as binary file

A much more efficient method is to save as a binary file. Assuming the data and indices can be stored as unsigned 16 bit integers you can do the following:

[n, m, s] = find(S);
fid = fopen('Sparse.dat', 'w');
fwrite(fid, size(S), 'uint16');
fwrite(fid, [n m s], 'uint16');
fclose(fid);

Then to read the data:

fid = fopen('Sparse.dat', 'r');
sz = fread(fid, 2, 'uint16');
s = reshape(fread(fid, 'uint16'), [], 3);
s = sparse(s(:, 1), s(:, 2), s(:, 3), sz(1), sz(2));
fclose(fid);

Now we can check they are equal:

isequal(S, s)

Saving the full array:

fid = fopen('Full.dat', 'w');
fwrite(fid, full(S), 'uint16');
fclose(fid);

Comparing the sparse and full file sizes I get 21MB and 95MB.

A couple of notes:

Using a single write/read command is much (much much) quicker than looping, so the last method is by far the fastest, and also most space efficient.
The maximum index/data value size that can be saved as a binary integer is 2^n - 1, where n is the bitdepth. In my example of 16 bits (uint16), that corresponds to a range of 0..65,535. By the sounds of it, you may need to use 32 bits or even 64 bits just to store the indices.
Higher efficiency can be obtained by saving the indices as one data type (e.g. uint32) and the actual values as another (e.g. uint8). However, this adds additional complexity in the saving and reading.
You will still want to store the matrix size first, as I showed in the binary example.
You can store the values as doubles if required, but indices should always be integers. Again, extra complexity, but doable.

Note that you can also zip up (and unzip) the file in Matlab, which will compress the text file very nicely. You may also find that the full text file compresses to a similar amount. — Wybird666, Feb 14 '18 at 00:55
That's a good answer, thanks. I need still to understand if it can be used easily with Python and R code. — gc5, Feb 15 '18 at 20:50
All these methods are compatible with any language, especially R and python. Text files are just that, plain ASCII text and therefore can be read by any language. All languages handle binary files - you simply have to tell it how the data is packed. Note different languages encode data differently (e.g. big-endian or little-endian), but you should be able to specify / convert. — Wybird666, Apr 18 '18 at 08:12

File format optimized for sparse matrix exchange

2 Answers2

Edit:

Generate array

Save as text file

Save as binary file