with 2mil+ arrays to work with, I immediately noticed a big difference between Warren Weckesser's solution and Tonsic's ones (thank you very much both)
with
first_array
[out]
array([(1633046400299000, 1.34707, 1.34748),
(1633046400309000, 1.347 , 1.34748),
(1633046400923000, 1.347 , 1.34749), ...,
(1635551693846000, 1.36931, 1.36958),
(1635551693954000, 1.36925, 1.36952),
(1635551697902000, 1.3692 , 1.36947)],
dtype=[('timestamp', '<i8'), ('bid', '<f8'), ('ask', '<f8')])
and
second_array
[out]
array([('2021-10-01T00:00:00.299000',), ('2021-10-01T00:00:00.309000',),
('2021-10-01T00:00:00.923000',), ...,
('2021-10-29T23:54:53.846000',), ('2021-10-29T23:54:53.954000',),
('2021-10-29T23:54:57.902000',)], dtype=[('date_time', '<M8[us]')])
I get
%timeit rfn.merge_arrays((first_array, second_array), flatten=True)
[out]
13.8 s ± 1.11 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
and
%timeit rfn.append_fields(first_array, 'date_time', second_array, dtypes='M8[us]').data
[out]
2.12 s ± 146 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
much better (and notice .data
at the end to avoid getting mask
and fill_value
)
whereas using something like
def building_new(first_array, other_array):
new_array = np.zeros(
first_array.size,
dtype=[('timestamp', '<i8'), ('bid', '<f8'), ('ask', '<f8'), ('date_time', '<M8[us]')])
new_array[['timestamp', 'bid', 'ask']] = first_array[['timestamp', 'bid', 'ask']]
new_array['date_time'] = other_array
return new_array
(notice that in a structured array every row is a tuple, so size works nicely)
I get
%timeit building_new(first_array, second_array)
[out]
67.2 ms ± 3.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
the output of all three is the same
[out]
array([(1633046400299000, 1.34707, 1.34748, '2021-10-01T00:00:00.299000'),
(1633046400309000, 1.347 , 1.34748, '2021-10-01T00:00:00.309000'),
(1633046400923000, 1.347 , 1.34749, '2021-10-01T00:00:00.923000'),
...,
(1635551693846000, 1.36931, 1.36958, '2021-10-29T23:54:53.846000'),
(1635551693954000, 1.36925, 1.36952, '2021-10-29T23:54:53.954000'),
(1635551697902000, 1.3692 , 1.36947, '2021-10-29T23:54:57.902000')],
dtype=[('timestamp', '<i8'), ('bid', '<f8'), ('ask', '<f8'), ('date_time', '<M8[us]')])
a final thought:
creating the new array instead of the recfunctions, the second array doesn't even need to be a structured one
third_array
[out]
array(['2021-10-01T00:00:00.299000', '2021-10-01T00:00:00.309000',
'2021-10-01T00:00:00.923000', ..., '2021-10-29T23:54:53.846000',
'2021-10-29T23:54:53.954000', '2021-10-29T23:54:57.902000'],
dtype='datetime64[us]')
%timeit building_new(first_array, third_array)
[out]
67 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)