Avoid cPickle error, alternatives to cPickle on Python 2.7

Question

I am using cPickle to pickle a huge output. It is a numpy array shaped (4000,600,600).

Unfortunately, there is a well-documented problem using Python2.7 and cPickle this way, with the error output:

SystemError: error return without exception set

This error has been solved however in Python3.4.

Please see here for details: https://github.com/numpy/numpy/issues/2396

I could just install the modules used and dependencies using Python3.4 and run the program, right? Unfortunately, there are various problems using Python 3.4 with this project. Various python modules do not have OpenMP and run into problems, etc.

Question 1: I have tried breaking up the array and pickling it in small parts. I'm still running into issues. What size should the array be such that cPickle still functions?

Question 2: How can I quickly output this file to be readable to another Python script/IPython notebook?

*Don't* pickle numpy arrays! It's extremely inefficient, both in terms of speed and storage requirements, and it will also lead to Python2/3 cross-compatibility issues. What are your actual requirements? If, for some reason, you really can't use numpy's native format (`np.save()` or `np.savez()`), you could use `array.tofile()` to output a plain binary output. HDF5 (e.g. PyTables or h5py) is another cross-platform format for storing large datasets that supports a much wider range of features. — ali_m, Dec 04 '15 at 22:00
@ali_m You have a point, however `array.tofile()` has serious issues. From the website: "This is a convenience function for quick storage of array data. Information on endianness and precision is lost, so this method is not a good choice for files intended to archive data or transport data between machines with different endianness." — ShanZhengYang, Dec 05 '15 at 08:11
@ali_m "It's extremely inefficient, both in terms of speed and storage requirements," Do you have any statistics for this? That would be helpful for me. — ShanZhengYang, Dec 05 '15 at 08:11
Yes, as I said, it depends on your requirements, which you haven't really explained in the question. Perhaps you don't need to store information about endianness or precision - I have no way of knowing. You said that "numpy's save and load methods will have problems" but you haven't said why. — ali_m, Dec 05 '15 at 12:48
The speed/storage requirements of pickling vs other storage methods will depend on the dataset. [This question](http://stackoverflow.com/q/16833124/1461210) has some data on speed, although it is probably a bit out of date. My answer [here](http://stackoverflow.com/a/33754559/1461210) compares the storage requirements of `cPickle.dump` against `np.save` and `joblib.dump`, which uses `np.save` internally for storing numpy arrays. The question is slightly different since it's about storing a class containing arrays rather than just arrays, which explains why `np.save` still does poorly. — ali_m, Dec 05 '15 at 13:08
@ali_m I was assuming `np.savez()` wouldn't work with numpy arrays of billions of elements. I was incorrect however. — ShanZhengYang, Dec 05 '15 at 21:41

Avoid cPickle error, alternatives to cPickle on Python 2.7

0 Answers0