14

I have a starting array such as:

[(1, [-112.01268501699997, 40.64249414272372])
 (2, [-111.86145708699996, 40.4945008710162])]

The first column is an int and the second is a list of floats. I need to add a str column called 'USNG'.

I then create a structured numpy array, as such:

dtype = numpy.dtype([('USNG', '|S100')])
x = numpy.empty(array.shape, dtype=dtype)

I want to append the x numpy array to the existing array as a new column, so I can output some information to that column for each row.

When I do the following:

numpy.append(array, x, axis=1)

I get the following error:

'TypeError: invalid type promotion'

I've also tried vstack and hstack

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
code base 5000
  • 3,812
  • 13
  • 44
  • 73

7 Answers7

16

You have to create a new dtype that contains the new field.

For example, here's a:

In [86]: a
Out[86]: 
array([(1, [-112.01268501699997, 40.64249414272372]),
       (2, [-111.86145708699996, 40.4945008710162])], 
      dtype=[('i', '<i8'), ('loc', '<f8', (2,))])

a.dtype.descr is [('i', '<i8'), ('loc', '<f8', (2,))]; i.e. a list of field types. We'll create a new dtype by adding ('USNG', 'S100') to the end of that list:

In [87]: new_dt = np.dtype(a.dtype.descr + [('USNG', 'S100')])

Now create a new structured array, b. I used zeros here, so the string fields will start out with the value ''. You could also use empty. The strings will then contain garbage, but that won't matter if you immediately assign values to them.

In [88]: b = np.zeros(a.shape, dtype=new_dt)

Copy over the existing data from a to b:

In [89]: b['i'] = a['i']

In [90]: b['loc'] = a['loc']

Here's b now:

In [91]: b
Out[91]: 
array([(1, [-112.01268501699997, 40.64249414272372], ''),
       (2, [-111.86145708699996, 40.4945008710162], '')], 
      dtype=[('i', '<i8'), ('loc', '<f8', (2,)), ('USNG', 'S100')])

Fill in the new field with some data:

In [93]: b['USNG'] = ['FOO', 'BAR']

In [94]: b
Out[94]: 
array([(1, [-112.01268501699997, 40.64249414272372], 'FOO'),
       (2, [-111.86145708699996, 40.4945008710162], 'BAR')], 
      dtype=[('i', '<i8'), ('loc', '<f8', (2,)), ('USNG', 'S100')])
Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214
  • A performance question, if you were doing a for loop on the data update of the array, would you vectorize the function? – code base 5000 Aug 21 '14 at 16:51
  • That depends on what type of update you are doing, and how you are going to vectorize it. It would be more effective to start a new question instead of discussing it in these comments. – Warren Weckesser Aug 21 '14 at 18:28
4

Have you tried using numpy's recfunctions?

import numpy.lib.recfunctions as rfn

It has some very useful functions for structured arrays.

For your case, I think it could be accomplished with:

a = rfn.append_fields(a, 'USNG', np.empty(a.shape[0], dtype='|S100'), dtypes='|S100')

Tested here and it worked.


merge_arrays

As GMSL mentioned in the comments. It is possible to do that with rfn.merge_arrays like below:

a = np.array([(1, [-112.01268501699997, 40.64249414272372]),
       (2, [-111.86145708699996, 40.4945008710162])], 
      dtype=[('i', '<i8'), ('loc', '<f8', (2,))])
a2 = np.full(a.shape[0], '', dtype=[('USNG', '|S100')])
a3 = rfn.merge_arrays((a, a2), flatten=True)

a3 will have the value:

array([(1, [-112.01268502,   40.64249414], b''),
       (2, [-111.86145709,   40.49450087], b'')],
      dtype=[('i', '<i8'), ('loc', '<f8', (2,)), ('USNG', 'S100')])
djvg
  • 11,722
  • 5
  • 72
  • 103
Tonsic
  • 890
  • 11
  • 15
  • 1
    the recfunction function `merge_arrays()` would be simpler in this case. – GMSL Oct 22 '20 at 12:02
  • 1
    Cool, didn't know about that. Required a bit of tinkering to find that the flatten argument was required to give the appropriate behavior :). Editted the answer accordingly. – Tonsic Oct 24 '20 at 18:36
2
  • If pandas is an option, it makes adding a column to a recarray, much easier.
  1. Read the current recarray with pandas.DataFrame or pandas.DataFrame.from_records.
  2. Add the new column of data to the dataframe
  3. Export the dataframe to a recarray with pandas.DataFrame.to_records
import pandas as pd
import numpy as np

# current recarray
data = np.rec.array([(1, list([-112.01268501699997, 40.64249414272372])), (2, list([-111.86145708699996, 40.4945008710162]))], dtype=[('i', '<i8'), ('loc', 'O')])

# create dataframe
df = pd.DataFrame(data)

# display(df)
   i                                       loc
0  1  [-112.01268501699997, 40.64249414272372]
1  2   [-111.86145708699996, 40.4945008710162]

# add new column
df['USNG'] = ['Note 1', 'Note 2']

# display(df)
   i                                       loc    USNG
0  1  [-112.01268501699997, 40.64249414272372]  Note 1
1  2   [-111.86145708699996, 40.4945008710162]  Note 2

# write the dataframe to recarray
data = df.to_records(index=False)

print(data)
[out]:
rec.array([(1, list([-112.01268501699997, 40.64249414272372]), 'Note 1'),
           (2, list([-111.86145708699996, 40.4945008710162]), 'Note 2')],
          dtype=[('i', '<i8'), ('loc', 'O'), ('USNG', 'O')])
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
1

The question is precisely: "Any suggestions on why this is happening?"

Fundamentally, this is a bug--- it's been an open ticket at numpy since 2012.

Mike O'Connor
  • 2,494
  • 1
  • 15
  • 17
1

with 2mil+ arrays to work with, I immediately noticed a big difference between Warren Weckesser's solution and Tonsic's ones (thank you very much both)

with

first_array
[out]
array([(1633046400299000, 1.34707, 1.34748),
       (1633046400309000, 1.347  , 1.34748),
       (1633046400923000, 1.347  , 1.34749), ...,
       (1635551693846000, 1.36931, 1.36958),
       (1635551693954000, 1.36925, 1.36952),
       (1635551697902000, 1.3692 , 1.36947)],
      dtype=[('timestamp', '<i8'), ('bid', '<f8'), ('ask', '<f8')])

and

second_array
[out]
array([('2021-10-01T00:00:00.299000',), ('2021-10-01T00:00:00.309000',),
       ('2021-10-01T00:00:00.923000',), ...,
       ('2021-10-29T23:54:53.846000',), ('2021-10-29T23:54:53.954000',),
       ('2021-10-29T23:54:57.902000',)], dtype=[('date_time', '<M8[us]')])

I get

%timeit rfn.merge_arrays((first_array, second_array), flatten=True)
[out]
13.8 s ± 1.11 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

and

%timeit rfn.append_fields(first_array, 'date_time', second_array, dtypes='M8[us]').data
[out]
2.12 s ± 146 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

much better (and notice .data at the end to avoid getting mask and fill_value)

whereas using something like

def building_new(first_array, other_array):
    new_array = np.zeros(
        first_array.size, 
        dtype=[('timestamp', '<i8'), ('bid', '<f8'), ('ask', '<f8'), ('date_time', '<M8[us]')])
    new_array[['timestamp', 'bid', 'ask']] = first_array[['timestamp', 'bid', 'ask']]
    new_array['date_time'] = other_array
    return new_array

(notice that in a structured array every row is a tuple, so size works nicely)

I get

%timeit building_new(first_array, second_array)
[out]
67.2 ms ± 3.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

the output of all three is the same

[out]
array([(1633046400299000, 1.34707, 1.34748, '2021-10-01T00:00:00.299000'),
       (1633046400309000, 1.347  , 1.34748, '2021-10-01T00:00:00.309000'),
       (1633046400923000, 1.347  , 1.34749, '2021-10-01T00:00:00.923000'),
       ...,
       (1635551693846000, 1.36931, 1.36958, '2021-10-29T23:54:53.846000'),
       (1635551693954000, 1.36925, 1.36952, '2021-10-29T23:54:53.954000'),
       (1635551697902000, 1.3692 , 1.36947, '2021-10-29T23:54:57.902000')],
      dtype=[('timestamp', '<i8'), ('bid', '<f8'), ('ask', '<f8'), ('date_time', '<M8[us]')])

a final thought:

creating the new array instead of the recfunctions, the second array doesn't even need to be a structured one

third_array
[out]
array(['2021-10-01T00:00:00.299000', '2021-10-01T00:00:00.309000',
       '2021-10-01T00:00:00.923000', ..., '2021-10-29T23:54:53.846000',
       '2021-10-29T23:54:53.954000', '2021-10-29T23:54:57.902000'],
      dtype='datetime64[us]')

%timeit building_new(first_array, third_array)
[out]
67 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
lexc
  • 61
  • 3
1

Here's a function that implements Warren's solution:

def happend(x, col_data,col_name:str):
    if not x.dtype.fields:  return None                                     # Not a structured array
    y = np.empty(x.shape, dtype=x.dtype.descr+[(col_name,col_data.dtype)])  # 0) create new structured array
    for name in x.dtype.fields.keys():  y[name] = x[name]                   # 1) copy old array
    y[col_name] = col_data                                                  # 2) copy new column
    return y

y = happend(x, np.arange(x.shape[0]),'idx')  # assuming `x` is a structured array
halfer
  • 19,824
  • 17
  • 99
  • 186
  • Hi Diego. I can't recall if we have had this conversation before, apologies. It is a strong editing guideline here that anything not germane to a post can be removed. We have a loose network of curators and editors who like to improve posts towards technical writing, so the Q&A material here sort-of resembles documentation or wiki material. – halfer Aug 04 '23 at 19:21
  • Thus political/religious/theistic material is often removed on sight - 'tis house style (un)fortunately! There is wide leeway for you to add what you like in your profile though. – halfer Aug 04 '23 at 19:22
0

Tonsic mentioned the recfunctions by import numpy.lib.recfunctions as rfn. In this case, a simpler recfunction function that would work for you is rfn.merge_arrays() (docs).

GMSL
  • 355
  • 2
  • 11