I have thousands of tuples of long (8640) lists of integers. For example:
type(l1)
tuple
len(l1)
2
l1[0][:10]
[0, 31, 23, 0, 0, 0, 0, 0, 0, 0]
l1[1][:10]
[0, 0, 11, 16, 24, 0, 0, 0, 0, 0]
I am "pickling" the tuples and it seems that when the tuples are of lists the pickle file is lighter than when are of numpy arrays. I am not that new to python, but by no means I am an expert and I don't really know how the memory is administrated for different types of objects. I would have expected numpy arrays to be lighter, but this is what I obtain when I pickle different types of objects:
#elements in the tuple as a numpy array
l2 = [np.asarray(l1[i]) for i in range(len(l1))]
l2
[array([ 0, 31, 23, ..., 2, 0, 0]), array([ 0, 0, 11, ..., 1, 0, 0])]
#integers in the array are small enough to be saved in two bytes
l3 = [np.asarray(l1[i], dtype='u2') for i in range(len(l1))]
l3
[array([ 0, 31, 23, ..., 2, 0, 0], dtype=uint16),
array([ 0, 0, 11, ..., 1, 0, 0], dtype=uint16)]
#the original tuple of lists
with open('file1.pkl','w') as f:
pickle.dump(l1, f)
#tuple of numpy arrays
with open('file2.pkl','w') as f:
pickle.dump(l2, f)
#tuple of numpy arrays with integers as unsigned 2 bytes
with open('file3.pkl','w') as f:
pickle.dump(l3, f)
and when I check the size of the files:
$du -h file1.pkl
72K file1.pkl
$du -h file2.pkl
540K file2.pkl
$du -h file3.pkl
136K file3.pkl
So even when the integers are saved in two bytes file1 is lighter than file3. I would prefer to use arrays because decompressing arrays (and processing them) is much faster than lists. However, I am going to be storing lots of these tuples (in a pandas data frame) so I would also like to optimise memory as much as possible.
The way I need this to work is, given a list of tuples I do:
#list of pickle objects from pickle.dumps
tpl_pkl = [pickle.dumps(listoftuples[i]) for i in xrange(len(listoftuples))]
#existing pandas data frame. Inserting new column
df['tuples'] = tpl_pkl
Overall my question is: Is there a reason why numpy arrays are taking more space than lists after pickling them into a file?
Maybe if I understand the reason I can find an optimal way of storing arrays.
Thanks in advance for your time.