numpy: How to add a column to an existing structured array?

Question

I have a starting array such as:

[(1, [-112.01268501699997, 40.64249414272372])
 (2, [-111.86145708699996, 40.4945008710162])]

The first column is an int and the second is a list of floats. I need to add a str column called 'USNG'.

I then create a structured numpy array, as such:

dtype = numpy.dtype([('USNG', '|S100')])
x = numpy.empty(array.shape, dtype=dtype)

I want to append the x numpy array to the existing array as a new column, so I can output some information to that column for each row.

When I do the following:

numpy.append(array, x, axis=1)

I get the following error:

'TypeError: invalid type promotion'

I've also tried vstack and hstack

Warren Weckesser · Accepted Answer · 2014-08-21T15:27:31.877

You have to create a new dtype that contains the new field.

For example, here's a:

In [86]: a
Out[86]: 
array([(1, [-112.01268501699997, 40.64249414272372]),
       (2, [-111.86145708699996, 40.4945008710162])], 
      dtype=[('i', '<i8'), ('loc', '<f8', (2,))])

a.dtype.descr is [('i', '<i8'), ('loc', '<f8', (2,))]; i.e. a list of field types. We'll create a new dtype by adding ('USNG', 'S100') to the end of that list:

In [87]: new_dt = np.dtype(a.dtype.descr + [('USNG', 'S100')])

Now create a new structured array, b. I used zeros here, so the string fields will start out with the value ''. You could also use empty. The strings will then contain garbage, but that won't matter if you immediately assign values to them.

In [88]: b = np.zeros(a.shape, dtype=new_dt)

Copy over the existing data from a to b:

In [89]: b['i'] = a['i']

In [90]: b['loc'] = a['loc']

Here's b now:

In [91]: b
Out[91]: 
array([(1, [-112.01268501699997, 40.64249414272372], ''),
       (2, [-111.86145708699996, 40.4945008710162], '')], 
      dtype=[('i', '<i8'), ('loc', '<f8', (2,)), ('USNG', 'S100')])

Fill in the new field with some data:

In [93]: b['USNG'] = ['FOO', 'BAR']

In [94]: b
Out[94]: 
array([(1, [-112.01268501699997, 40.64249414272372], 'FOO'),
       (2, [-111.86145708699996, 40.4945008710162], 'BAR')], 
      dtype=[('i', '<i8'), ('loc', '<f8', (2,)), ('USNG', 'S100')])

A performance question, if you were doing a for loop on the data update of the array, would you vectorize the function? — code base 5000, Aug 21 '14 at 16:51
That depends on what type of update you are doing, and how you are going to vectorize it. It would be more effective to start a new question instead of discussing it in these comments. — Warren Weckesser, Aug 21 '14 at 18:28

score 4 · Answer 2 · edited Jun 15 '21 at 14:15

Have you tried using numpy's recfunctions?

import numpy.lib.recfunctions as rfn

It has some very useful functions for structured arrays.

For your case, I think it could be accomplished with:

a = rfn.append_fields(a, 'USNG', np.empty(a.shape[0], dtype='|S100'), dtypes='|S100')

Tested here and it worked.

merge_arrays

As GMSL mentioned in the comments. It is possible to do that with rfn.merge_arrays like below:

a = np.array([(1, [-112.01268501699997, 40.64249414272372]),
       (2, [-111.86145708699996, 40.4945008710162])], 
      dtype=[('i', '<i8'), ('loc', '<f8', (2,))])
a2 = np.full(a.shape[0], '', dtype=[('USNG', '|S100')])
a3 = rfn.merge_arrays((a, a2), flatten=True)

a3 will have the value:

array([(1, [-112.01268502,   40.64249414], b''),
       (2, [-111.86145709,   40.49450087], b'')],
      dtype=[('i', '<i8'), ('loc', '<f8', (2,)), ('USNG', 'S100')])

the recfunction function `merge_arrays()` would be simpler in this case. — GMSL, Oct 22 '20 at 12:02
Cool, didn't know about that. Required a bit of tinkering to find that the flatten argument was required to give the appropriate behavior :). Editted the answer accordingly. — Tonsic, Oct 24 '20 at 18:36

Trenton McKinney · Answer 3 · 2020-11-23T21:30:56.033

If pandas is an option, it makes adding a column to a recarray, much easier.
- Additionally, the data will be in a form that's easily analyzed
- numpy is a pandas dependency, and makes many operations easier.
- Also see How to add a column to numpy recarry as another example.

Read the current recarray with pandas.DataFrame or pandas.DataFrame.from_records.
Add the new column of data to the dataframe
Export the dataframe to a recarray with pandas.DataFrame.to_records

import pandas as pd
import numpy as np

# current recarray
data = np.rec.array([(1, list([-112.01268501699997, 40.64249414272372])), (2, list([-111.86145708699996, 40.4945008710162]))], dtype=[('i', '<i8'), ('loc', 'O')])

# create dataframe
df = pd.DataFrame(data)

# display(df)
   i                                       loc
0  1  [-112.01268501699997, 40.64249414272372]
1  2   [-111.86145708699996, 40.4945008710162]

# add new column
df['USNG'] = ['Note 1', 'Note 2']

# display(df)
   i                                       loc    USNG
0  1  [-112.01268501699997, 40.64249414272372]  Note 1
1  2   [-111.86145708699996, 40.4945008710162]  Note 2

# write the dataframe to recarray
data = df.to_records(index=False)

print(data)
[out]:
rec.array([(1, list([-112.01268501699997, 40.64249414272372]), 'Note 1'),
           (2, list([-111.86145708699996, 40.4945008710162]), 'Note 2')],
          dtype=[('i', '<i8'), ('loc', 'O'), ('USNG', 'O')])

score 1 · Answer 4 · answered Mar 03 '15 at 11:13

1

The question is precisely: "Any suggestions on why this is happening?"

Fundamentally, this is a bug--- it's been an open ticket at numpy since 2012.

answered Mar 03 '15 at 11:13

Mike O'Connor

2,494
1
15
17

lexc · Answer 5 · 2021-11-07T17:54:15.510

with 2mil+ arrays to work with, I immediately noticed a big difference between Warren Weckesser's solution and Tonsic's ones (thank you very much both)

with

first_array
[out]
array([(1633046400299000, 1.34707, 1.34748),
       (1633046400309000, 1.347  , 1.34748),
       (1633046400923000, 1.347  , 1.34749), ...,
       (1635551693846000, 1.36931, 1.36958),
       (1635551693954000, 1.36925, 1.36952),
       (1635551697902000, 1.3692 , 1.36947)],
      dtype=[('timestamp', '<i8'), ('bid', '<f8'), ('ask', '<f8')])

and

second_array
[out]
array([('2021-10-01T00:00:00.299000',), ('2021-10-01T00:00:00.309000',),
       ('2021-10-01T00:00:00.923000',), ...,
       ('2021-10-29T23:54:53.846000',), ('2021-10-29T23:54:53.954000',),
       ('2021-10-29T23:54:57.902000',)], dtype=[('date_time', '<M8[us]')])

I get

%timeit rfn.merge_arrays((first_array, second_array), flatten=True)
[out]
13.8 s ± 1.11 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

and

%timeit rfn.append_fields(first_array, 'date_time', second_array, dtypes='M8[us]').data
[out]
2.12 s ± 146 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

much better (and notice .data at the end to avoid getting mask and fill_value)

whereas using something like

def building_new(first_array, other_array):
    new_array = np.zeros(
        first_array.size, 
        dtype=[('timestamp', '<i8'), ('bid', '<f8'), ('ask', '<f8'), ('date_time', '<M8[us]')])
    new_array[['timestamp', 'bid', 'ask']] = first_array[['timestamp', 'bid', 'ask']]
    new_array['date_time'] = other_array
    return new_array

(notice that in a structured array every row is a tuple, so size works nicely)

I get

%timeit building_new(first_array, second_array)
[out]
67.2 ms ± 3.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

the output of all three is the same

[out]
array([(1633046400299000, 1.34707, 1.34748, '2021-10-01T00:00:00.299000'),
       (1633046400309000, 1.347  , 1.34748, '2021-10-01T00:00:00.309000'),
       (1633046400923000, 1.347  , 1.34749, '2021-10-01T00:00:00.923000'),
       ...,
       (1635551693846000, 1.36931, 1.36958, '2021-10-29T23:54:53.846000'),
       (1635551693954000, 1.36925, 1.36952, '2021-10-29T23:54:53.954000'),
       (1635551697902000, 1.3692 , 1.36947, '2021-10-29T23:54:57.902000')],
      dtype=[('timestamp', '<i8'), ('bid', '<f8'), ('ask', '<f8'), ('date_time', '<M8[us]')])

a final thought:

creating the new array instead of the recfunctions, the second array doesn't even need to be a structured one

third_array
[out]
array(['2021-10-01T00:00:00.299000', '2021-10-01T00:00:00.309000',
       '2021-10-01T00:00:00.923000', ..., '2021-10-29T23:54:53.846000',
       '2021-10-29T23:54:53.954000', '2021-10-29T23:54:57.902000'],
      dtype='datetime64[us]')

%timeit building_new(first_array, third_array)
[out]
67 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

score 1 · Answer 6 · edited Aug 04 '23 at 19:19

1

Here's a function that implements Warren's solution:

def happend(x, col_data,col_name:str):
    if not x.dtype.fields:  return None                                     # Not a structured array
    y = np.empty(x.shape, dtype=x.dtype.descr+[(col_name,col_data.dtype)])  # 0) create new structured array
    for name in x.dtype.fields.keys():  y[name] = x[name]                   # 1) copy old array
    y[col_name] = col_data                                                  # 2) copy new column
    return y

y = happend(x, np.arange(x.shape[0]),'idx')  # assuming `x` is a structured array

edited Aug 04 '23 at 19:19

halfer

19,824
17
99
186

answered Sep 21 '22 at 09:09

Diego Alonso

71
6

Hi Diego. I can't recall if we have had this conversation before, apologies. It is a strong editing guideline here that anything not germane to a post can be removed. We have a loose network of curators and editors who like to improve posts towards technical writing, so the Q&A material here sort-of resembles documentation or wiki material. – halfer Aug 04 '23 at 19:21
Thus political/religious/theistic material is often removed on sight - 'tis house style (un)fortunately! There is wide leeway for you to add what you like in your profile though. – halfer Aug 04 '23 at 19:22

score 0 · Answer 7 · answered Oct 22 '20 at 12:04

0

Tonsic mentioned the recfunctions by import numpy.lib.recfunctions as rfn. In this case, a simpler recfunction function that would work for you is rfn.merge_arrays() (docs).

answered Oct 22 '20 at 12:04

GMSL

355
2
11

numpy: How to add a column to an existing structured array?

7 Answers7

merge_arrays

Linked