4

My dictionary will consist of several thousand keys which each key having a 1000x1000 numpy array as value. I don't need the file to be human readable. Small size and fast loading times are more important.

First I tried savemat, but I ran into problems. Pickle resulted in a huge file. I assume the same for csv. I've read posts recommending using json (readable text probably huge) or db (assumingly complicated). What would you recommend for my case?

Community
  • 1
  • 1
Framester
  • 33,341
  • 51
  • 130
  • 192
  • "in a huge file"? Define huge. 1000x1000 is a million values. Each each value is an int, then you have 4Mb of data. – S.Lott Feb 10 '12 at 18:40
  • @S.Lott It resulted in a 1,6 GB file – Framester Feb 10 '12 at 18:44
  • Following S.Lott's calculation that is only 410 Keys with a 1000x1000 int matrix. – Nobody moving away from SE Feb 10 '12 at 18:47
  • @Framester: Why did you expected it to be any smaller? – S.Lott Feb 10 '12 at 18:53
  • @S.Lott: I expect a file where all the matrices are saved in compressed byte code to be way smaller than it would be the case for saving it in plain text. – Framester Feb 10 '12 at 19:04
  • Why don't you just compress the files with the built-in `gzip` module? – HardlyKnowEm Feb 10 '12 at 19:07
  • "compressed byte code"? How can a 4-byte int be compressed any smaller than 4 bytes? How can 1,000,000 4-byte ints be compressed any smaller? I'm unclear on where this compression can happen. Can you **update** the question to explain why this file is unacceptably huge? – S.Lott Feb 10 '12 at 19:55
  • @Framester: If you have a lot of repetition you could try a fractal compression algorithm. Anyhow, you can't have your cake and eat it too. You want smaller file size, it's going to cost you compression/decompression time. – Joel Cornett Feb 10 '12 at 21:08
  • In the end, compression is probably not necessary, if it is an easy to use format. Storing a number with three digits like `158` as plaintext aka human readable needs 3X8 bits and only 8 bit as bytecode. – Framester Feb 11 '12 at 00:00

5 Answers5

6

If you have a dictionary where the keys are strings and the values are arrays, like this:

>>> import numpy
>>> arrs = {'a': numpy.array([1,2]),
            'b': numpy.array([3,4]),
            'c': numpy.array([5,6])}

You can use numpy.savez to save them, by key, to a compressed file:

>>> numpy.savez('file.npz', **arrs)

To load it back:

>>> npzfile = numpy.load('file.npz')
>>> npzfile
<numpy.lib.npyio.NpzFile object at 0x1fa7610>
>>> npzfile['a']
array([1, 2])
>>> npzfile['b']
array([3, 4])
>>> npzfile['c']
array([5, 6])
jterrace
  • 64,866
  • 22
  • 157
  • 202
3

The filesystem itself is often an underappreciated data structure. You could have a dictionary that is a map from your keys to filenames, and then each file has the 1000x1000 array in it. Pickling the dictionary would be quick and easy, and then the data files can just contain raw data (which numpy can easily load).

Greg Hewgill
  • 951,095
  • 183
  • 1,149
  • 1,285
  • Thanks for the fast answer. I actually have all the keys as single files atm, but I want to change this as loading all these files takes ~15 mins. – Framester Feb 10 '12 at 18:42
  • @Framester: What tells you that it is slow because of the number of files instead of their size? – Nobody moving away from SE Feb 10 '12 at 18:43
  • 2
    @Framester: Investigate using memory mapped files ([`mmap` module](http://docs.python.org/library/mmap.html)). Then there is almost no cost to "load" the data, it's all accessed on-demand. You may need a 64-bit OS to mmap all your data though. – Greg Hewgill Feb 10 '12 at 18:46
  • @Nobody: Isn't accessing a lot of small files slow or slower than accessing one large file with the same contents. – Framester Feb 10 '12 at 18:47
  • 1
    @Framester: Yes it is slower but you have to take in account how much. If you need 15Min to read the whole data and only 1s for going through the files then it does not pay to think about moving to one file. If it is the other way round then of course one should do ^^ – Nobody moving away from SE Feb 10 '12 at 18:50
2

How about numpy.savez? It can save multiple numpy array and they are binary so it should be faster than pickle.

tkf
  • 2,990
  • 18
  • 32
  • 1
    Pickled data is binary too, as long as you use something other than protocol 0 (which is ASCII). And for speed, use [`cPickle`](http://docs.python.org/library/pickle.html#module-cPickle). – Greg Hewgill Feb 10 '12 at 19:51
  • @GregHewgill I knew cPickle but didn't know you can have binary pickle. Thanks! // Not mean to ruin good follow up, but I think using savez is faster in this case because it is specialized for saving numpy arrays. Well, I guess it will also depends on size so benchmark is needed to decide of course. – tkf Feb 10 '12 at 21:59
  • Yes, `savez` is appropriate for this case. Just wanted to make you aware of the different pickle protocols. – Greg Hewgill Feb 10 '12 at 22:05
  • Thanks, but I ran into trouble using filenames as keys: http://stackoverflow.com/q/9258069/380038 – Framester Feb 13 '12 at 09:24
0

Google's Protobuf specification is designed to be extremely efficient on overhead. I'm not sure how fast at (de)serializing it is, but being Google, I imagine it's not shabby.

Ivo
  • 5,378
  • 2
  • 18
  • 18
0

You can use PyTables (http://www.pytables.org/moin) , and save your data in HDF5 format.

HYRY
  • 94,853
  • 25
  • 187
  • 187