Unifying pickled numpy arrays into a single file efficiently

Question

I have got a folder of thousands of pickled one-dimensional numpy arrays, of which each array has a length of 921603 integer values (up to 3 digits each).

Like So:

folder/
  |0.pkl
  |1.pkl
  |2.pkl
   ...
  |5000.pkl

The goal is to convert them into a final merged.csv file, so that each datapoint in form of the pickled numpy array represents a row in the output file.

My super inefficient approaches that I tried:

Loading the pickles and iterate through them to construct a string which is then appended to a csv file. :(
Using numpy.savetxt() did also not work out as smoothly as I had hoped...

The final goal is to get a merged file that acts as training data for tensorflow, so I also welcome different sparks of ideas for different and possibly optimized packaging methods of the datapoints.

I would be really happy for any small comments and ideas!

It sounds like you could write up some sort of input stage that feeds data loaded from these pickles directly to tensorflow, rather than making a CSV. A CSV seems superfluous. — user2357112, Apr 20 '18 at 21:03
Did you know that you can save multiple pickles to a single file? `with open(path, 'wb') as f: pickle.dump(obj1, f); pickle.dump(obj2, f); ...` — scnerd, Apr 20 '18 at 21:03
I'd recommend using something like Keras' `flow_from_directory(directory)` method to help you load these files as needed. I believe it works just fine with binary data too. More documentation on this page: https://keras.io/preprocessing/image/ — Grant Williams, Apr 20 '18 at 21:05
Don't write it in a csv, that will be very slow to read and write. Use HDF5 instead. Avoid this very usal failure when doing so. https://stackoverflow.com/a/48405220/4045774 — max9111, Apr 20 '18 at 22:32

score 0 · Answer 1 · answered Apr 21 '18 at 04:24

A straight forward numpy approach is to collect the arrays in a list, concatenate them into one big array, and then save that.

alist = []
for file in dir:
   with open(file,'rb') as f:
      alist.append(pickle.load(f))
arr = np.array(alist)
# or arr = np.stack(alist)

arr should then be a 2d array.

np.save(bigfile, arr) will save the whole thing in one file.

(by the way, pickle of an array uses the np.save format)

np.savetxt(bigfile, arr, fmt='%3d', delimiter=',') should also work to save the array in csv format.

Experiment with a a subset of the pickles.

Unifying pickled numpy arrays into a single file efficiently

1 Answers1