8

I have thousands of tuples of long (8640) lists of integers. For example:

type(l1)
tuple

len(l1)
2

l1[0][:10]
[0, 31, 23, 0, 0, 0, 0, 0, 0, 0]

l1[1][:10]
[0, 0, 11, 16, 24, 0, 0, 0, 0, 0] 

I am "pickling" the tuples and it seems that when the tuples are of lists the pickle file is lighter than when are of numpy arrays. I am not that new to python, but by no means I am an expert and I don't really know how the memory is administrated for different types of objects. I would have expected numpy arrays to be lighter, but this is what I obtain when I pickle different types of objects:

#elements in the tuple as a numpy array
l2 = [np.asarray(l1[i]) for i in range(len(l1))]
l2
[array([ 0, 31, 23, ...,  2,  0,  0]), array([ 0,  0, 11, ...,  1,  0,  0])]

#integers in the array are small enough to be saved in two bytes
l3 = [np.asarray(l1[i], dtype='u2') for i in range(len(l1))]
l3
[array([ 0, 31, 23, ...,  2,  0,  0], dtype=uint16),
 array([ 0,  0, 11, ...,  1,  0,  0], dtype=uint16)]

#the original tuple of lists
with open('file1.pkl','w') as f:
     pickle.dump(l1, f)

#tuple of numpy arrays
with open('file2.pkl','w') as f:
    pickle.dump(l2, f)

#tuple of numpy arrays with integers as unsigned 2 bytes
with open('file3.pkl','w') as f:
    pickle.dump(l3, f)

and when I check the size of the files:

 $du -h file1.pkl
  72K   file1.pkl

 $du -h file2.pkl
  540K  file2.pkl

 $du -h file3.pkl
 136K   file3.pkl

So even when the integers are saved in two bytes file1 is lighter than file3. I would prefer to use arrays because decompressing arrays (and processing them) is much faster than lists. However, I am going to be storing lots of these tuples (in a pandas data frame) so I would also like to optimise memory as much as possible.

The way I need this to work is, given a list of tuples I do:

#list of pickle objects from pickle.dumps
tpl_pkl = [pickle.dumps(listoftuples[i]) for i in xrange(len(listoftuples))]

#existing pandas data frame. Inserting new column 
df['tuples'] = tpl_pkl

Overall my question is: Is there a reason why numpy arrays are taking more space than lists after pickling them into a file?

Maybe if I understand the reason I can find an optimal way of storing arrays.

Thanks in advance for your time.

Javier
  • 1,530
  • 4
  • 21
  • 48
  • 1
    I don't know the exact method that these are stored, however I did have a similar issue, and solved it using the hickle package. https://github.com/telegraphic/hickle This saves numpy arrays to file in HDF5 format, has an option to gzip them and ended up much smaller. Another great thing is that it uses the same syntax as pickle. – ijmarshall Sep 09 '15 at 17:10
  • why are you storing serialized objects in a dataframe? in other words, why are you storing the pickle dump in a dataframe? – gabe Sep 09 '15 at 17:24
  • @ijmarshall: thanks!, I will have a look to that package – Javier Sep 09 '15 at 20:31
  • @gabe: these arrays have to be "linked" to a series of ids and some other features (that are the other columns of a data frame) and if possible in an indexed structure. Eventually, the dataset is going to be queried, joined etc. I thought about using pySQLite, and following the same format, i.e. storing pickle.dumps in a table in a sqlite data base. However, after some reading, I thought pandas could already be enough. Anyway, at this stage, any other idea is more than welcome. – Javier Sep 09 '15 at 20:35

2 Answers2

3

If you want to store numpy arrays on disk you shouldn't be using pickle at all. Investigate numpy.save() and its kin.

If you are using pandas then it too has its own methods. You might want to consult this article or the answer to this question for better techniques.

Community
  • 1
  • 1
holdenweb
  • 33,305
  • 7
  • 57
  • 77
  • thanks holdenweb. The problem is that each of these will be a row in pandas data frame. I am doing `row_n = pickle.dumps(tuple[n])` and then inserting this in a pandas data frame which I don't think is possible with numpy.save(). – Javier Sep 09 '15 at 17:15
  • 1
    Updated with a couple of useful Pandas links. Good luck! – holdenweb Sep 09 '15 at 17:19
  • I have edited my question in case is of any use. I will have a look to those links. Thanks! – Javier Sep 09 '15 at 17:21
  • Someone else may provide better answers, but given the size of your data I suspect time may be more important than speed – holdenweb Sep 09 '15 at 17:24
-2

If the data you provided is close to accurate, this seems like premature optimization to me, as that is really not a lot of data, and supposedly only integers. I am pickling a file right now with millions of entries, of strings and integers, and then you can worry about optimization. In your case the difference likely does not matter that much, especially if this is run manually and does not feed into some webapp or similar.

step21
  • 1
  • 1