2

The following code constructs a NumPy array with a dtype object:

dt = np.dtype([
    ("index", np.int32),
    ("timestamp", np.int32),
    ("volume", np.float32)
])

arr = np.array([
    [0, 20, 3],
    [1, 21, 2],
    [2, 23, 8],
    [3, 26, 5],
    [4, 31, 9]
]).astype(dt)

The expected result of arr would be:

>>> arr
array([[  0,  20, 334.],
       [  1,  21, 254.],
       [  2,  23, 823.],
       [  3,  26, 521.],
       [  4,  31, 943.]])

>>> arr[0]
array([  0,  20, 334.])

But what the code above is creating is actually this:

>>> arr
array([[(  0,   0,   0.), ( 20,  20,  20.), (334, 334, 334.)],
       [(  1,   1,   1.), ( 21,  21,  21.), (254, 254, 254.)],
       [(  2,   2,   2.), ( 23,  23,  23.), (823, 823, 823.)],
       [(  3,   3,   3.), ( 26,  26,  26.), (521, 521, 521.)],
       [(  4,   4,   4.), ( 31,  31,  31.), (943, 943, 943.)]],
      dtype=[('index', '<i4'), ('timestamp', '<i4'), ('volume', '<f4')])

>>> arr[0]
array([(  0,   0,   0.), ( 20,  20,  20.), (334, 334, 334.)],
      dtype=[('index', '<i4'), ('timestamp', '<i4'), ('volume', '<f4')])

Why is NumPy creating a version of every value for every data type instead of mapping each column to its own data type (and only this one)? I'm guessing that I did something wrong there. Is there a way to get to the result I was expecting?

yatu
  • 86,083
  • 12
  • 84
  • 139
Jivan
  • 21,522
  • 15
  • 80
  • 131

1 Answers1

2

The issue here is that for the structured array creation you need a list of tuples. This is mentioned in Structured Datatype Creation, where it states that among other less common methods of array creation, the input data must be a list of tuples, one tuple per field.

So what you can do is turn your array into a list of tuples (zip will be convenient here) and build the structured array from it using np.fromiter and specifying dt as dtype:

np.fromiter(zip(*arr.T), dtype=dt)
array([(0, 20, 3.), (1, 21, 2.), (2, 23, 8.), (3, 26, 5.), (4, 31, 9.)],
      dtype=[('index', '<i4'), ('timestamp', '<i4'), ('volume', '<f4')])

Another (lesser known) approach as mentioned by @hpaulj in the comments, is using np.lib.recfunctions.unstructured_to_structured, which can be used to directly construct the structured array from arr and the dtype object with:

np.lib.recfunctions.unstructured_to_structured(a, dt)
array([(0, 20, 3.), (1, 21, 2.), (2, 23, 8.), ..., (2, 23, 8.),
       (3, 26, 5.), (4, 31, 9.)],
      dtype=[('index', '<i4'), ('timestamp', '<i4'), ('volume', '<f4')])

Or based on this other post there's also the possibility to create a record array, an ndarray subclass, very similar to a structured array in terms of usage, that comes with several associated helper functions, such as np.core.records.fromarrays that can be used for the creation of the array as in a simple way:

np.core.records.fromarrays(arr.T, 
                           names='index, timestamp, volume', 
                           formats = '<i4, <i4, <f4')
rec.array([(0, 20, 3.), (1, 21, 2.), (2, 23, 8.), (3, 26, 5.),
           (4, 31, 9.)],
          dtype=[('index', '<i4'), ('timestamp', '<i4'), ('volume', '<f4')])

Or to create it from the np.dtype object:

names, dtypes = list(zip(*dt.descr))
np.core.records.fromarrays(arr.transpose(), 
                           names= ', '.join(names), 
                           formats = ', '.join(dtypes))

Timings comparing the mentioned methods, and some other possible approaches:

a = np.concatenate([arr]*1000, axis=0)

%%timeit 
np.core.records.fromarrays(a.T, 
                           names='index, timestamp, volume', 
                           formats = '<i4, <i4, <f4')
# 57.9 µs ± 1.18 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit np.lib.recfunctions.unstructured_to_structured(a, dt)
# 79.6 µs ± 1.32 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit np.fromiter(zip(*a.T), dtype=dt)
#2.1 ms ± 69.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit np.fromiter(map(tuple, a), dtype=dt)
#6.34 ms ± 65.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit np.array(list(zip(*a.T)), dtype=dt)
# 2.17 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
yatu
  • 86,083
  • 12
  • 84
  • 139
  • ah! but then you'd lose the ability to do things like isolate columns quickly with `arr[:, 0]` for instance – Jivan Jun 17 '20 at 11:32
  • I mean, you can use `arr["index"]` but I'm wondering if, when using tuples like this, performance would be equivalent as the "pure array" form – Jivan Jun 17 '20 at 11:34
  • You can index on the names, such as `a['index']` @jivan – yatu Jun 17 '20 at 11:34
  • 1
    Well I'm unsure tbh until what point working with structured arrays is optimized in numpy as opposed to regular ndarrays, but I'd guess that performance does worsen @jivan – yatu Jun 17 '20 at 11:36
  • 1
    yes, that's what I'm thinking as well. Gonna stick to regular ndarrays for now, even if that mean having columns which should be `np.int8` being cast into `np.float64`... – Jivan Jun 17 '20 at 11:37
  • You can also use `numpy.lib.recfunctions.unstructured_to_structured` – hpaulj Jun 17 '20 at 14:14
  • Thanks @hpaulj , first time I see this one, but seems the most straight forward way of doing this. Added the timings fyi – yatu Jun 17 '20 at 15:13
  • `TypeError: Cannot change data-type for object array.` when using `unstructured_to_structured`. – gargoylebident Apr 12 '21 at 22:07