17

I have a dataframe made up of 100,000+ rows and each row has 100,000 columns, totally to 10,000,000,000 float values.

I've managed to read them in previously in a csv (tab-separated) file and I successfully read them to a 50 cores Xeon machine with 250GB RAM and try to write it out as a .parq directory as such:

The floats in huge.csv were saved as strings and it is 125GB.

import dask.dataframe as dd
filename = 'huge.csv'
df = dd.read_csv(filename, delimiter='\t', sample=500000000)
df.to_parquet('huge.parq')

It has been writing to the huge.parq for close to a week and the directory is 14GB and it seems like the process of saving .to_parquet is not going to stop any time soon.

And free -mh is showing that there's still memory left available but the time it's taking to save the .parq directory is tremendously slow:

$ free -mh
              total        used        free      shared  buff/cache   available
Mem:           251G         98G         52G         10M        101G        152G
Swap:          238G          0B        238G

The questions are:

  • Given the size of the dataframe and machine, is it feasible to save the dask dataframe to a parquet file at all?

  • Is it normal for dask and fastparquet to take so long to save huge dataframes?

  • Is there some way to estimate the time it will take to save a parquet file?

alvas
  • 115,346
  • 109
  • 446
  • 738
  • 10e9 float values doesn't seem huge to me. 1e5 columns does though. Have you considered using dask.array and HDF5? These might be better suited for blocking in both dimensions. – MRocklin May 26 '17 at 06:07
  • 1
    Is there a reason why dask.array and HDF5 is better for dataframes with >>> no. of columns? What is "blocking"? – alvas May 26 '17 at 09:01
  • How many rows per partition? read_csv splits on number of bytes, so I expect a small number. For each column of each partition, there is a separate piece of metadata that must exist, making your metadata bigger than any I've seen before - but I would expect it to work. For storing array-like 100kx100k floats, I actually recommend [zarr](http://zarr.readthedocs.io/en/latest/). – mdurant May 26 '17 at 12:57
  • Parquet creates a new segment of data for every column. So every column has a non-trivial cost. HDF5 or ZArr can "block" or group data by row and by column. This tends to be nicer if you have many rows and many columns – MRocklin May 26 '17 at 22:52

1 Answers1

17

As discussed in the comments above, there is no theoretical reason that .to_parquet() should not cope with your data. However, the number of columns is extremely large, and because there is an overhead associated with each, it is not surprising that the process is taking a long time - this is not the typical use case.

It sounds like your data is best thought of as an array rather than a table. There are array storage mechanisms that allow you to chunk in every dimension, for instance zarr, which also allows for various compression and pre-filtering operations that can make efficient use of disc space. (Other formats like HDF5 are also popular for a task like this)

An example of how to store a 10k X 10k array:

import dask.array as da
import zarr
arr = da.random.random(size=(10000, 10000), chunks=(1000, 1000))
z = zarr.open_array('z.zarr', shape=(10000, 10000), chunks=(1000, 1000), mode='w', dtype='float64')
arr.store(z)

and now z.zarr/ contains 100 data-file chunks.

In your case, the tricky part is reading the data in, since you don't know a priori the number of rows. You could use

df = dataframe.read_csv(..)
len(df)  # get length
z = zarr.open_arr(...)  # provide dtype, size and chunk appropriately
df.values.store(z)

or it may be more efficient to wrap np.loadtxt with dask.delayed to forgo the dataframe stage.

mdurant
  • 27,272
  • 5
  • 45
  • 74
  • 1
    There are datasets like KDD-2009 (http://www.kdd.org/kdd-cup/view/kdd-cup-2009/Data), which has 15k columns and 50k records. It is not 100k by 100k, but it is a columnar dataset, so it doesn't make any sense to handle it as a matrix. Do you happen to know the limits of Dask DataFrame? – Vlad Frolov Jun 01 '17 at 05:56
  • 2
    I would say that there are no particular limits, but that the price you pay in overheads for various computations will depend on what it is you are trying to do. I would be interested to see the performance of all that data stored as parquet (with sensible choices of column data types). – mdurant Jun 01 '17 at 13:54