1

I have got a folder of thousands of pickled one-dimensional numpy arrays, of which each array has a length of 921603 integer values (up to 3 digits each).

Like So:

folder/
  |0.pkl
  |1.pkl
  |2.pkl
   ...
  |5000.pkl

The goal is to convert them into a final merged.csv file, so that each datapoint in form of the pickled numpy array represents a row in the output file.

My super inefficient approaches that I tried:

  • Loading the pickles and iterate through them to construct a string which is then appended to a csv file. :(

  • Using numpy.savetxt() did also not work out as smoothly as I had hoped...

The final goal is to get a merged file that acts as training data for tensorflow, so I also welcome different sparks of ideas for different and possibly optimized packaging methods of the datapoints.

I would be really happy for any small comments and ideas!

Cryptic Pug
  • 527
  • 4
  • 19
  • 1
    It sounds like you could write up some sort of input stage that feeds data loaded from these pickles directly to tensorflow, rather than making a CSV. A CSV seems superfluous. – user2357112 Apr 20 '18 at 21:03
  • 1
    Did you know that you can save multiple pickles to a single file? `with open(path, 'wb') as f: pickle.dump(obj1, f); pickle.dump(obj2, f); ...` – scnerd Apr 20 '18 at 21:03
  • 1
    I'd recommend using something like Keras' `flow_from_directory(directory)` method to help you load these files as needed. I believe it works just fine with binary data too. More documentation on this page: https://keras.io/preprocessing/image/ – Grant Williams Apr 20 '18 at 21:05
  • 1
    Don't write it in a csv, that will be very slow to read and write. Use HDF5 instead. Avoid this very usal failure when doing so. https://stackoverflow.com/a/48405220/4045774 – max9111 Apr 20 '18 at 22:32

1 Answers1

0

A straight forward numpy approach is to collect the arrays in a list, concatenate them into one big array, and then save that.

alist = []
for file in dir:
   with open(file,'rb') as f:
      alist.append(pickle.load(f))
arr = np.array(alist)
# or arr = np.stack(alist)

arr should then be a 2d array.

np.save(bigfile, arr) will save the whole thing in one file.

(by the way, pickle of an array uses the np.save format)

np.savetxt(bigfile, arr, fmt='%3d', delimiter=',') should also work to save the array in csv format.

Experiment with a a subset of the pickles.

hpaulj
  • 221,503
  • 14
  • 230
  • 353