I produce a very large data file with Python, mostly consisting of 0
(false) and only a few 1
(true). It has about 700.000 columns and 15.000 rows and thus a size of 10.5GB. The first row is the header.
This file then needs to be read and visualized in R.
I'm looking for the right data format to export my file from Python.
As stated here:
HDF5 is row based. You get MUCH efficiency by having tables that are not too wide but are fairly long.
As I have a very wide table, I assume, HDF5 is inappropriate in my case?
So what data format suits best for this purpose?
Would it also make sense to compress (zip) it?
Example of my file:
id,col1,col2,col3,col4,col5,...
1,0,0,0,1,0,...
2,1,0,0,0,1,...
3,0,1,0,0,1,...
4,...