7

I have a numpy structured array with a mixed dtype (i.e., floats, ints, and strings). I want to select some of the columns of the array (all of which contain only floats) and then get the sum, by column, of the rows, as a standard numpy array. The initial array takes a form comparable to:

some_data = np.array([('foo', 3.5, 2.15), ('bar', 2.8, 5.3), ('baz', 1.2, 3.7)], 
                     dtype=[('col1', '<U20'), ('A', '<f8'), ('B', '<f8')])

For this example, I'd like to take the sum of columns A and B, yielding np.array([7.5, 11.15]). With numpy ≤1.13, I could do that as follows:

get_cols = ['A', 'B']
desired_sum = np.sum(some_data[get_cols].view(('<f8', len(get_cols))), axis=0)

With the release of numpy 1.14, this method now fails with ValueError: Changing the dtype to a subarray type is only supported if the total itemsize is unchanged, which is a result of the changes made in numpy 1.14 to the handling of structured arrays. (User bbengfort commented about the FutureWarning given about this change in this answer.)

In light of these changes to structured arrays, how can I obtain the desired sum from the structured array subset?

trynthink
  • 301
  • 5
  • 13
  • a 'regression' check https://github.com/numpy/numpy/issues/10387 and other threads for interim solutions until a 'fix' is in place – NaN Jan 15 '18 at 16:54
  • @NaN: This particular problem is different from that one - It's not going to be fixed unless a separate issue is reported. In this case, it should be possible to sum the array without incurring a copy – Eric Jan 15 '18 at 23:31
  • 1
    Issue opened [here](https://github.com/numpy/numpy/issues/10409) – Eric Jan 16 '18 at 00:21
  • What version of numpy were you using? What is x in 1.14.x? – Eric Jun 10 '18 at 19:29
  • @Eric, the error arises with numpy 1.14.0. In numpy 1.14.x, where x ≥ 1, a FutureWarning is triggered instead. – trynthink Jun 13 '18 at 17:58

1 Answers1

2
In [165]: some_data = np.array([('foo', 3.5, 2.15), ('bar', 2.8, 5.3), ('baz', 1.2, 3.7)], dtype=[('col1', '<U20'), ('A', '<f8'), ('B', '<f8')])
     ...:                      
In [166]: get_cols = ['A','B']
In [167]: some_data[get_cols]
Out[167]: 
array([( 3.5,  2.15), ( 2.8,  5.3 ), ( 1.2,  3.7 )],
      dtype=[('A', '<f8'), ('B', '<f8')])

Simply reading the field values is fine. In 1.13 we get a warning

In [168]: some_data[get_cols].view(('<f8', len(get_cols)))
/usr/local/bin/ipython3:1: FutureWarning: Numpy has detected that you may be viewing or writing to an array returned by selecting multiple fields in a structured array. 

This code may break in numpy 1.13 because this will return a view instead of a copy -- see release notes for details.
  #!/usr/bin/python3
Out[168]: 
array([[ 3.5 ,  2.15],
       [ 2.8 ,  5.3 ],
       [ 1.2 ,  3.7 ]])

With the recommended copy, no warning:

In [169]: some_data[get_cols].copy().view(('<f8', len(get_cols)))
Out[169]: 
array([[ 3.5 ,  2.15],
       [ 2.8 ,  5.3 ],
       [ 1.2 ,  3.7 ]])
In [171]: np.sum(_, axis=0)
Out[171]: array([  7.5 ,  11.15])

In your original array,

dtype([('col1', '<U20'), ('A', '<f8'), ('B', '<f8')])

An A,B slice would have the two f8 items interspersed with the 20U items. Changing the view dtype of such a mix is problematic. That's why working with a copy is more reliable.

Since U20 takes up 4*20 bytes, the total itemsize is 96, a multiple of 8. We can convert the whole thing to f8, reshape and 'throw-away' the U20 columns:

In [183]: some_data.view('f8').reshape(3,-1)[:,-2:]
Out[183]: 
array([[ 3.5 ,  2.15],
       [ 2.8 ,  5.3 ],
       [ 1.2 ,  3.7 ]])

It's not very pretty and I don't recommend it, but it may give some insight into how structured data is arranged.

view on a structured array is useful at times, but often a bit tricky to use correctly.

If the 2 numeric fields are usually used together, I'd recommend a compound dtype like:

In [184]: some_data = np.array([('foo', [3.5, 2.15]), ('bar', [2.8, 5.3]), ('baz
     ...: ', [1.2, 3.7])], 
     ...:                      dtype=[('col1', '<U20'), ('AB', '<f8',(2,))])
     ...:                      
     ...:                      
In [185]: some_data
Out[185]: 
array([('foo', [ 3.5 ,  2.15]), ('bar', [ 2.8 ,  5.3 ]),
       ('baz', [ 1.2 ,  3.7 ])],
      dtype=[('col1', '<U20'), ('AB', '<f8', (2,))])
In [186]: some_data['AB']
Out[186]: 
array([[ 3.5 ,  2.15],
       [ 2.8 ,  5.3 ],
       [ 1.2 ,  3.7 ]])

genfromtxt accepts this style of dtype.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • While adding `.copy()` in front of view eliminates the FutureWarning from numpy 1.13, that line will still trigger the same ValueError from numpy 1.14. – trynthink Jan 16 '18 at 14:25