4

I produce a very large data file with Python, mostly consisting of 0 (false) and only a few 1 (true). It has about 700.000 columns and 15.000 rows and thus a size of 10.5GB. The first row is the header.
This file then needs to be read and visualized in R.

I'm looking for the right data format to export my file from Python.

As stated here:

HDF5 is row based. You get MUCH efficiency by having tables that are not too wide but are fairly long.

As I have a very wide table, I assume, HDF5 is inappropriate in my case?

So what data format suits best for this purpose?
Would it also make sense to compress (zip) it?

Example of my file:

id,col1,col2,col3,col4,col5,...
1,0,0,0,1,0,...
2,1,0,0,0,1,...
3,0,1,0,0,1,...
4,...
Community
  • 1
  • 1
Evgenij Reznik
  • 17,916
  • 39
  • 104
  • 181
  • 2
    The best choice is probably some form of sparse matrix representation (i.e., it indicates the row and column positions of 1s). The `Matrix` package in R has quite a few of these formats, but I don't know the best way to handle interchange with Python. I don't know if HDF5 handles sparsity in a sensible way. – Ben Bolker Jan 19 '16 at 21:52
  • You could transpose your matrix, and then they'd be taller than wide. Sounds like SNPs, so you should give some thought to the software you're going to use to analyze it; maybe there are custom or other formats required. You could run-length encode it, e.g., 0 0 0 0 0 1 1 0 0 --> 5 2 2; you'd probably do better with the encoding 5 7 9 for better random access. This will have excellent compression but will require your own algos; see though the Rle class in Bioconductor [S4Vectors](http://bioconductor.org/packages/S4Vectors). R stores logicals as ints, so 42 GB. – Martin Morgan Jan 19 '16 at 22:01
  • @MartinMorgan: What exactly would be the benefit of transposing the matrix? – Evgenij Reznik Jan 19 '16 at 22:09
  • @user1170330 you said HDF5 works well on long matrices, so make them long... I personally wouldn't have thought this to be a pain point in a typical R-based analysis. – Martin Morgan Jan 19 '16 at 22:15

2 Answers2

4

Zipping won't help you, as you'll have to unzip it to process it. If you could post your code that generates the file, that might help a lot. Also, what do yo want to accomplish in R? Might it be faster to visualize it in Python, avoiding the read/write of 10.5GB?

Perhaps rethinking your approach to how you're storing the data (eg: store the coordinates of the 1's if there are very few) might be a better angle here.

For instance, instead of storing a 700K by 15K table of all zeroes except for a 1 in line 600492 column 10786, I might just store the tuple (600492, 10786) and achieve the same visualization in R.

Owen Hempel
  • 434
  • 2
  • 8
  • Second the suggestion, if there are only a few 1s exporting only the co-ordinates would be much easier. – Hansi Jan 19 '16 at 22:23
  • Is there an example of how to do that exactly? And is there also an appropriate way for importing such files in R? – Evgenij Reznik Jan 19 '16 at 22:26
  • 1
    Hvaen't looked into it in detail but exporting the co-ordinates to csv or json seem logical and then import that into R and then use the Matrix package in R to create a sparse matrix to conserve memory in there. – Hansi Jan 20 '16 at 08:34
  • @user1170330 if you post your Python code that makes the file I can help you with it. – Owen Hempel Jan 20 '16 at 13:23
1

SciPy has scipy.io.mmwrite which makes files that can be read by R's readMM command. SciPy also supports several different sparse matrix representations.

Hugh Bothwell
  • 55,315
  • 8
  • 84
  • 99