Populating values in numpy compound dataset is slow; why?

Question

I have the following numpy compound datatype:

mytype = numpy.dtype([('x', 'f8'),
                      ('y', 'f8'),
                      ('z', 'f8'))])

However, when I try to fill a vector of this type, it 60x slower than three separate arrays:

#!/usr/bin/env python3

import time
import random
import numpy

mytype = numpy.dtype([('x', 'f8'),
                      ('y', 'f8'),
                      ('z', 'f8')])

size = 1000000
v = numpy.empty(shape=(size,), dtype=mytype)

print("Start inserting into compound type:")
start = time.time()
for i in range(size):
    v[i]['x'] = random.random()
    v[i]['y'] = random.random()
    v[i]['z'] = random.random()

end = time.time()
print("Done inserting into compound type: Time elapsed: {}.\n".format(end - start))


x = numpy.empty(shape=(size,), dtype='f8')
y = numpy.empty(shape=(size,), dtype='f8')
z = numpy.empty(shape=(size,), dtype='f8')

print("Inserting into three arrays:")
start = time.time()
for i in range(size):
    x[i] = random.random()
    y[i] = random.random()
    z[i] = random.random()
end = time.time()
print("Done inserting into three arrays. Time elapsed: {}".format(end - start))

print("Reading from compound type:")

start = time.time()
for i in range(size):
    x1 = v[i]['x']
    y1 = v[i]['y']
    z1 = v[i]['z']

end = time.time()
print("Done reading compound type: Time elapsed: {}.\n".format(end -start))

print("Reading from three arrays:")
start = time.time()
for i in range(size):
    x1 = x[i]
    y1 = y[i]
    z1 = z[i]
end = time.time()
print("Done reading three arrays. Time elapsed: {}.\n".format(end - start))

In addition, I find that reading numpy compound datatypes 70x slower than the corresponding separated datatypes. How I can increase the performance of numpy compound datatypes?

Edit: After cloning numpy from master, this performance bug went away.

hpaulj · Accepted Answer · 2015-11-04T21:57:43.330

3

Yes working element by element with structured arrays will be slower. Which means you should try to perform array operations where possible:

v=np.empty(10,dtype=ymytype)
v['x']=np.random.random(10)
v['y']=np.random.random(10)
v['z']=np.random.random(10)

Will be faster than your i by i iteration. But it will still be slower than the 2d array equivalent:

v = np.random.random((10,3))

You can also assign or access values record by record:

for i in range(10):
    v[i] = np.random.random(3)

But if the number rows is much more than the number of fields (a typical case), it is better to assign values by field.

If you want fast operations on arrays, and all values are of the same type, stick with the nd arrays. Structured arrays are more useful when the field types differ, such as a mix of strings, ints and floats.

If all the elements of the structured array are of the same dtype (as in your case, all floats) it is possible to map back and forth between structured dtype and the 2d array, giving the best of both worlds. I've discussed that in other SO questions.

edited Nov 04 '15 at 21:57

answered Nov 04 '15 at 21:50

hpaulj

221,503
14
230
353

I suppose I picked too symmetric an example, as IRL my compound dataset does have different types. However, I would expect vectors of C-style structs to be just as fast as multiple arrays, so why is numpy different? – user14717 Nov 05 '15 at 02:47
It doesn't use `c` structs, at least not directly. The data storage may be as compact (a simple array of bytes) but it still has to move the data in and out of Python objects (including tuples). My guess is that with a more general dtype, more of the processing has to be at the Python level, and less in compiled code. – hpaulj Nov 05 '15 at 03:10
Hmm, well it looks like `numpy.dtype` has some support for specifying byte offsets . . . I'll give that a try . . . – user14717 Nov 05 '15 at 14:06
I explored offsets a bit my answer to: http://stackoverflow.com/questions/26349116/no-binary-operators-for-structured-arrays-in-numpy – hpaulj Nov 05 '15 at 17:19
I have attempted to specify byte-offsets in a hail-mary attempt to improve speed. No dice. I guess I'll just have to use separated vectors . . . – user14717 Nov 12 '15 at 21:03

Populating values in numpy compound dataset is slow; why?

1 Answers1