Numpy : Translating elements increases size of file by a lot (factor of 8)

Question

I have a 3D array that only contains values of 0, 1 and 2 and want to translate those values to 0,128 and 255 respectively. I have looked around and this thread ( Translate every element in numpy array according to key ) seems like the way to go.

So I tried implementing it and it worked, the relevant part of the code can be seen below (I read and write data from and to h5 files but I doubt that's important, I just mention it in case it is)

#fetch dataset from disk
f = h5py.File('input/A.h5','r') #size = 572kB

#read and transform array
array = f['data'].value  #type = numpy.ndarray
my_dict = {1:128, 2:255, 0:0}
array=np.vectorize(my_dict.get)(array)

#write translated dataset to disk
h5 = h5py.File('output/B.h5', driver=None) #final size = 4.5MB
h5.create_dataset('data', data=array)  
h5.close()

Problem is, the input file (A.h5) is of size 572kB, the output file (B.h5) is 8 times as large (4.5MB).

What is going on here ? I have another array with the same dimensions full of values from 0 to 255 and it also is of size 572kB, so the numbers being larger shouldn't matter. My first guess was that maybe python was creating objects instead of ints, I tried casting to int but the size stays the same.

side note : if I transform data with 3 indented for loops then the size stays 572kB (but the code is much slower)

I suspect another answer in the liked question is faster, http://stackoverflow.com/a/29055933/901925. Even with that you'll want to watch the result dtype. You can also specify dtype in the create_dataset statement. — hpaulj, Apr 25 '17 at 12:59

xnx · Accepted Answer · 2017-04-25T11:20:49.423

2

You're likely getting a factor of 8 by writing back your array as int64 where the original array is stored as uint8. You could try:

array=np.vectorize(my_dict.get)(array).astype(np.uint8)

and then saving to h5...

As @Jaime points out, you save an array copy by telling vectorize what datatype you want straight off:

array=np.vectorize(my_dict.get, otypes=[np.uint8])(array)

edited Apr 25 '17 at 11:20

answered Apr 25 '17 at 10:44

xnx

24,509
11
70
109

thanks, I actually checked the data type of the elements after writing the array to disk by reading it back later on and it was uint8 which is why I didn't consider doing this. But it was probably translated back to uint8 after loading the array and was stores as int64 – Skum Apr 25 '17 at 10:53
1

`np.vectorize(my_dict.get, otypes=[np.uint8])` will achieve the same and spare you an array copy, which is what `.astype()` ends up doing. – Jaime Apr 25 '17 at 10:53
1

Thanks @Jaime – I wrote my answer in a hurry on the train. I'll edit in your improvement. – xnx Apr 25 '17 at 11:19

score 1 · Answer 2 · edited May 23 '17 at 11:54

While the linked SO accepted answer uses np.vectorize, it isn't the fastest choice, especially in a case like this where you are simply replacing 3 small numbers, 0,1,2.

A new answer in that SO question gives a simple and fast indexing alternative:

https://stackoverflow.com/a/29055933/901925

In [508]: x=np.random.randint(0,3,(100,100,100))
In [509]: x.size
Out[509]: 1000000
In [510]: x1=np.vectorize(my_dict.get, otypes=['uint8'])(x)
In [511]: arr=np.array([0,128,255],np.uint8)
In [512]: x2=arr[x]
In [513]: np.allclose(x1,x2)
Out[513]: True

compare their times:

In [514]: timeit x1=np.vectorize(my_dict.get, otypes=['uint8'])(x)
10 loops, best of 3: 161 ms per loop
In [515]: timeit x2=arr[x]
100 loops, best of 3: 3.48 ms per loop

The indexing approach is much faster.

There are couple of things about np.vectorize that users often miss.

the speed disclaimer; it does not promise significant speed compared to explicit iteration. It does, though, make iterating over multidimensional arrays easier.
without otypes, it determines the type of the return array from a test calculation. Sometimes that default causes problems. Here specifying the otypes is just a convenience, giving you the correct dtype right away.

As a matter of curiosity, here's the time for list comprehension approach:

In [518]: timeit x3=np.array([my_dict[i] for i in x.ravel()]).reshape(x.shape)
1 loop, best of 3: 556 ms per loop

h5py lets you specify the dtype when saving a dataset. Notice the type when I save arrays in different ways.

In [529]: h5.create_dataset('data1',data=x1, dtype=np.uint8)
Out[529]: <HDF5 dataset "data1": shape (100, 100, 100), type "|u1">
In [530]: h5.create_dataset('data2',data=x1, dtype=np.uint16)
Out[530]: <HDF5 dataset "data2": shape (100, 100, 100), type "<u2">
In [531]: h5.create_dataset('data3',data=x1)
Out[531]: <HDF5 dataset "data3": shape (100, 100, 100), type "|u1">
In [532]: x.dtype
Out[532]: dtype('int32')
In [533]: h5.create_dataset('data4',data=x)
Out[533]: <HDF5 dataset "data4": shape (100, 100, 100), type "<i4">
In [534]: h5.create_dataset('data5',data=x, dtype=np.uint8)
Out[534]: <HDF5 dataset "data5": shape (100, 100, 100), type "|u1">

So even if you didn't specify uint8 in vectorize you could still have saved it with that type.

Numpy : Translating elements increases size of file by a lot (factor of 8)

2 Answers2