1

I have a large image dataset to store. I have 300,000 images. Each image is a vector of 28800 pixels, which means that I have a matrix of (300000, 28800)

I stored that as follow

img_arr = np.stack(images, axis=0)

np.savetxt('pixels_dataset_large.csv',img_arr,delimiter=",")

However it takes long time to load it and some times l get memory error :

data_pixels=np.genfromtxt("pixels_dataset_large.csv", delimiter=',')

Is there any alternative to optimally store and load it ?

kmario23
  • 57,311
  • 13
  • 161
  • 150
vincent
  • 1,558
  • 4
  • 21
  • 34
  • 4
    What's wrong with `np.save`/`np.load`? It saves the data as it is in memory, so no parsing is involved and the process is going to be as fast as the disk allows. – ivan_pozdeev Apr 18 '17 at 14:00
  • @ivan_pozdeev, l do have probelm only with loading. when l try np.load() l get TypeError: load() got an unexpected keyword argument 'dtype' – vincent Apr 18 '17 at 14:12
  • 2
    This means you're using `load` incorrectly. [It doesn't have a `dtype` argument](https://docs.scipy.org/doc/numpy/reference/generated/numpy.load.html). – ivan_pozdeev Apr 18 '17 at 14:15
  • l did the following np.load("pixels_dataset_large.csv", delimiter=','). there is no dype parameter to put in load() !! – vincent Apr 18 '17 at 14:17
  • Then a [mcve] is in order. – ivan_pozdeev Apr 18 '17 at 14:20
  • 1
    Ivan recommended storing a binary representation which is much more clever (save, load). You seem now to use np.load (again: binary!) to load your text-based (non-binary!) stuff saved with savetxt (reading binary-repr will never take delimiter-info). Of course that won't work! (so basically: savetxt <-> genfromtxt; save <-> load). [numpy's docs on load even gives a complete example](https://docs.scipy.org/doc/numpy/reference/generated/numpy.load.html). – sascha Apr 18 '17 at 15:20

1 Answers1

1

If you're saving 300,000 x 28,000 data to csv, then assuming a float representation you're looking at an output file size of just shy of a terabyte, depending on the precision of the output. Even if you have a terabyte of disk space lying around, CSV is incredibly inefficient at this scale.

I'd suggest some binary storage scheme in this case (e.g. hdf5). You might check out the xarray package: it's pretty well-suited to working with dense array data of this size, it has an API that's very similar to NumPy, and it even leverages Dask for transparent support of parallel and/or memory-mapped computation.

jakevdp
  • 77,104
  • 11
  • 125
  • 160