1

I am trying for hours to come up with the most efficient approach to structure and append flowing tick data to a shared memory numpy array and later get a pandas DataFrame in a timely fashion.

#source tick data comes in as dict
tick_data = {"bid": float(1.2), "ask": float(1.3), "time": datetime.datetime.now()}


#construct np array
dtype_conf = [('bid', '<f4'), ('ask', '<f4'), ('time', 'datetime64[us]')]
new_tick = np.array([(11.11, 22.22, now)], dtype=dtype_conf)

#append / vstack / .. it to existing shared numpy array
shared_np_array = np.vstack((shared_np_array, new_tick))

#fast construction of pd.DataFrame if needed 
pd.DataFrame(shared_np_array.reshape((1,-1))[0])

Questions:

1) What is the right way to structure my array and (faster) append new tick data to it?

2) What would be the most efficient approach to create either a pd.DataFrame of the complete array or a pd.Series for a column?

3) Is there a better way to work with shared memory timeseries in python (besides multiprocessing.basemanager)?

Many thanks!

trbck
  • 5,187
  • 6
  • 26
  • 29
  • I recommend creating a list of tuples, and making a structured array once, https://stackoverflow.com/q/48751127/901925 – hpaulj Feb 19 '18 at 16:58
  • Are you aware that with each `vstack` you create a new array? – hpaulj Feb 19 '18 at 17:49
  • I see. Do you have a better idea how to append more efficiently? – trbck Feb 19 '18 at 20:25
  • See also [python - Fastest way to grow a numpy numeric array - Stack Overflow](https://stackoverflow.com/questions/7133885/fastest-way-to-grow-a-numpy-numeric-array) – user202729 Oct 25 '21 at 00:56

1 Answers1

2

numpy is not a good choice of data type for appending data.

The most versatile choice in python is collections.deque, which is optimized for inserting items at the beginning or end of the list.

This is how your code might look:

import pandas as pd, numpy as np
import datetime
from collections import deque

now = datetime.datetime.now()
lst_d = deque()

#source tick data comes in as dict
tick_data = {"bid": float(1.2), "ask": float(1.3), "time": now}

#construct np array
dtype_conf = [('bid', '<f4'), ('ask', '<f4'), ('time', 'datetime64[us]')]
new_tick = np.array([(11.11, 22.22, now)], dtype=dtype_conf)

# existing deque object named lst_d
lst_d.append(list(new_tick))

# example of how your deque may look
lst_d = deque([[1, 2, 'time1'], [3, 4, 'time3'], [4, 5, 'time4']])

#fast dataframe construction
print(pd.DataFrame(list(lst_d), columns=['bid', 'ask', 'time']))

#    bid  ask   time
# 0    1    2  time1
# 1    3    4  time3
# 2    4    5  time4

Not sure why reshape is required with a numpy array:

# example of how your deque may look
lst_d = np.array([[1, 2, 'time1'], [3, 4, 'time3'], [4, 5, 'time4']])

#fast dataframe construction
print(pd.DataFrame(lst_d, columns=['bid', 'ask', 'time']))

#    bid  ask   time
# 0    1    2  time1
# 1    3    4  time3
# 2    4    5  time4
jpp
  • 159,742
  • 34
  • 281
  • 339
  • Thanks! Problem is that I am stuck to a numpy array as I was planning to use that via posix shared memory (https://gitlab.com/tenzing/shared-array) for fast write and read between several processes. I was looking for a fast python shared memory solution for quite some time since multiprocess.basemanager and a class dict/ list setup was too slow. – trbck Feb 19 '18 at 15:18
  • @trbck, I see. Unfortunately, I don't know how you can improve numpy append performance. Will keep my answer up so that others don't fall into the trap. – jpp Feb 19 '18 at 15:20
  • I mean "appending" via vstack is quite fast. But the need to reshape the np.array to create a dataframe with pd.DataFrame(a.reshape((1,-1))[0]) slows it down quite much. I was also wondering if I can change the structure of the np.array to faster create a dataframe here as dtypes (so no further processing/ calculation needed for df) are already properly set up within the np.array anyway. – trbck Feb 19 '18 at 16:21
  • @trbck, I added an example. I'm not sure why `reshape` is necessary – jpp Feb 19 '18 at 16:30