2

I have a numpy ndarray holding numpy.float64 data stored to a file in binary format using cPickle's dump() method.

from cPickle import dump, HIGHEST_PROTOCOL
with open(filePath, 'wb') as f:
    dump(numpyArray, f, protocol=HIGHEST_PROTOCOL)

At the time of this writing, HIGHEST_PROTOCOL uses cPickle's protocol version 2 but there doesn't seem to be much documentation on how exactly this protocol works.

What I'm trying to do is read this file and create a cv::Mat object (see here) with the data, which is proving quite difficult to do.

At this point, I'm looking to get things working as quickly as possible and I am not too worried about performance, storage space and efficiency. However, these factors might become important later.

Thus, my question would be, what is the easiest way I can go about converting the data in this file into a cv::Mat object? If you think that the easiest way isn't necessarily the most efficient way then I would love to hear your thoughts on that as well. Note that I'm open to using a different storage format, possibly just a text file, if it will make interoperability between Python and C++ easier.

I have to store the numpy array to disk because I need to be able to open and read this file on a mobile device (iOS and Android) and using a network call to get the data is not really on the table at the moment.

Sid
  • 1,144
  • 10
  • 21
  • 3
    At work, we avoided pickling and went for the raw byte arrays in numpy. We get the byte array, compress it using lz4, and then it is pretty easy reconstructing the array in other languages. We use mongodb to store the lz4 compressed byte array. – Brian Pendleton Sep 02 '15 at 17:48
  • thanks for the suggestion but in the end I just decided to output the `ndarray` as CSV and then load it into C++. I know it's not the most efficient solution but for now it works. – Sid Sep 04 '15 at 18:17

1 Answers1

1

Pickle is probably not the convenient way to transport data to languages other than Python.

In fact I'd say Pickle is not really suitable for data storage at all, since:

  • It needs Python
  • It might not work if it was saved using a later version of Python than what you're using
  • It's unsafe if you don't trust the data source

Which is not to say it doesn't have it's uses: it's convenient for things like cache, personal scripts or communicating data between processes.

Others might disagree with that opinion though.

So what might you use? Here are some ideas:

  • Binary format, using tofile. This is probably the way to go for speed and size, and not terribly hard to load.
  • CSV file, possibly compressed (for 1D/2D arrays). You can use savetxt.
  • JSON, possibly compressed, with tolist() and dumps. This will be slow and yield large files, but it'll be portable and it'll work for any dimension and even for unequal row/column lengths.
  • If you can use Pandas, it supports many formats.

Some more just for fun:

  • Save a 2D array of small integers as a lossless grayscale image. Or with more effort, use 3 colors and alpha channel to store a single-precision float array.
  • Use (Fortan) unformatted data (python, C) which is actually fairly efficient use of space, but plagued by many portability issues.
  • As a b64 (b85 for extra points) encoded string. Quite portable (b64 anyway) if you know the matrix layout, and probably smaller than plain text (like csv).

EDIT: here is a benchmark for various methods:

array storage benchmark

Community
  • 1
  • 1
Mark
  • 18,730
  • 7
  • 107
  • 130