Most efficient way to convert a dictionary with list of numpy arrays into pandas dataframe?

Question

I am trying to do bulk calculations on multiple stock symbols over many time periods using numpy vectorization, but I am not sure if how I am approaching the problem is most efficient. I am using the "tulipy" technical indicator library to perform calculations on the "Close" price of different stocks.

Here is what the source data would look like as a dataframe:

 Ticker Date        Close
0   A   1999-11-18  31.473534
1   A   1999-11-19  28.880543
2   A   1999-11-22  31.473534
3   A   1999-11-23  28.612303
4   A   1999-11-24  29.372318

Here is reproducible code:

d = {'Ticker': ['A','A','A','A','A','A','A','A','A','A','A','A','A','A','A'], 'Date': ['1999-11-18', '1999-11-19', '1999-11-22', '1999-11-23', '1999-11-24','1999-11-26', '1999-11-29', '1999-11-30', '1999-12-01', '1999-12-02','1999-12-03', '1999-12-06', '1999-12-07', '1999-12-08', '1999-12-09'], 'Close': ['31.473534','28.880543','31.473534','28.612303','29.372318','29.461731','30.132332','30.177038','30.713520','31.562946','31.831188','32.725323','32.367668','32.322960','32.770027']}
df = pd.DataFrame(data=d)

Calculations would be done on the floating point numbers.

Then, I am creating a numpy array from the queried results and finding the unique values (tickers) to iterate over for indexing purposes:

import pandas as pd
import numpy as np
import tulipy as ti

a = np.array([(df['Ticker']),(df['Date']),(df['Close'])])
x_unique = np.unique(a[0], return_counts=True, return_index=True)

Here is what the source data would look like as a numpy array:

array([['A', Timestamp('1999-11-18 00:00:00'), 31.473533630371094],
       ['A', Timestamp('1999-11-19 00:00:00'), 28.880542755126953],
       ['A', Timestamp('1999-11-22 00:00:00'), 31.473533630371094],

Then, I am iterating over each unique ticker and calculating the Relative Strength Index (RSI) from the "tulipy" technical indicators library, while also maintaining the respective date for each price data point, and storing those values in a dictionary:

dictt = {}
start = 0
d = 0
for i in x_unique[0]:
    dictt[i] = [ti.rsi(a[2, start:start+x_unique[2][d]].astype(float), 14), a[1, start:start+x_unique[2][d]]]                    
    start = start+x_unique[2][d]
    d+=1

All of this works fine and is quite quick, but I feel as if there is a numpy way of iterating over each unique ticker rather than using the "for i in x_unique[0]", as this could become resource intensive once I implement 6,000+ tickers. Here is the dictionary (dictt) sample output:

{'A': [array([54.98280996, 51.72842265, 53.80685853, ..., 71.42460267,
         68.75692746, 65.20371964]),
  array([Timestamp('1999-11-18 00:00:00'), Timestamp('1999-11-19 00:00:00'),
         Timestamp('1999-11-22 00:00:00'), ...,
         Timestamp('2019-11-27 00:00:00'), Timestamp('2019-11-29 00:00:00'),
         Timestamp('2019-12-02 00:00:00')], dtype=object),
  array([31.47353363, 30.89731344, 31.02536237, ..., 79.59926089,
         79.85942439, 79.96844085]),

The next issue is how do I get this into a pandas dataframe and/or a sql database in the most efficient manner, because getting the data into a readable manner is where most of the script's inefficiency lies. Here is the output of me converting it to a pandas dataframe, which is not in the desired format that I would want:

pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in dictt.items() ]))

A   AAPL    AEP AMAT    ATVI
0   [54.98280995952094, 51.72842265345178, 53.8068...   [64.49276119489626, 58.970440366273245, 54.374...   [48.78048780487805, 38.63298662704309, 37.7864...   [41.830064699495054, 45.295506922269944, 44.34...   [45.23809231868367, 45.23809231868366, 58.8677...
1   [1999-11-18 00:00:00, 1999-11-19 00:00:00, 199...   [1980-12-12 00:00:00, 1980-12-15 00:00:00, 198...   [1970-01-02 00:00:00, 1970-01-05 00:00:00, 197...   [1980-03-17 00:00:00, 1980-03-18 00:00:00, 198...   [1993-10-25 00:00:00, 1993-10-26 00:00:00, 199...
2   [31.473533630371094, 30.897313435872395, 31.02...   [0.5133928656578064, 0.5074404809210036, 0.494...   [30.625, 30.708333333333332, 30.71759259259259...   [0.0954861119389534, 0.09510030928585264, 0.09...   [0.9375, 0.9212962918811374, 0.908693407788688...

I would prefer it to output like this:

    rsi date    ema8    ema20   ema50   ema100  ema200
A   54.9828 1999-11-18  31.4735 31.4735 31.4735 31.4735 31.4735
A   51.7284 1999-11-19  30.8973 31.2266 31.3718 31.4222 31.4477

So my main questions are:

Is there a better way to perform vectorized calculations on this dataset while maintaining the order/matching of stock symbols, dates, and their calculated values?
Could I disregard storing all the data in a dictionary and keep everything in a numpy array?
Can I directly store the values from the numpy array into a sql database or pandas dataframe that is tabular in nature?

UPDATE for joining dates and rsi on dataframe:

df_dictt = pd.DataFrame(dictt).T

output = pd.DataFrame()
for i in range(len(df_dictt)):
    df_join = pd.DataFrame(df_dictt[1][i]).rename(columns={1:'date'}).join(pd.DataFrame(df_dictt[0][i]).rename(columns={0:'rsi'}), how='left')
    df_join['rsi'] = df_join['rsi'].shift(14)
    df_join['ticker'] = df_dictt.index[i]
    output = output.append(df_join, ignore_index=False)

It's a bit difficult to follow, could you share more of the data? As it stands, we don't even have the NumPy arrays. — AMC, Feb 05 '20 at 17:51
This might be helpful: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples. If you could replace your SQL code (which seems irrelevant) with a sample dataframe that illustrates the problems, that would be helpful, and ideally ask one question at a time. — Stuart, Feb 05 '20 at 17:55
@Bird_Muffin We need the data to understand how it is being transformed. It still isn't clear to me what exactly is happening. — AMC, Feb 05 '20 at 18:39
@AMC Ok I will try to clarify as much as possible. I have added a sample of the original dataframe. The "Close" price in the dataframe is the series of data that is being calculated from the tulipy python library (ti). Then those values are stored into a dictionary with the "Ticker" and "Date". I am most curious as how to transform that dictionary of list of numpy arrays into a readable dataframe (the fastest/most efficient way). — Bird_Muffin, Feb 05 '20 at 18:48
What's the `dtype` of the "source data as array". I'm seeing a mix of string, time, and float. — hpaulj, Feb 05 '20 at 21:08
@Stuart It takes 1 input array and outputs 1 array of values — Bird_Muffin, Feb 05 '20 at 22:05

Stuart · Accepted Answer · 2020-02-05T23:27:29.143

0

To make the dictionary, you should use pandas DataFrame.groupby rather than processing into arrays and then iterating through them.

dictt = {ticker: ti.rsi(group["Close"], 14)
         for ticker, group in df.groupby("Ticker")}

group["Close"] will be a pandas series. If the rsi function doesn't accept this, then you might need to add something like group["Close"].to_numpy(dtype=float) to convert them into numpy arrays.

You can add the rsi values directly to the dataframe, one group at a time, like this:

for _, group in df.groupby("Ticker"):
    df.loc[group.index[14:], "rsi"] = ti.rsi(group["Close"], 14)

You could also apply the function to groups, but in this case you need to convert to a pandas series each time so that the indices are retained:

def rsi(group, period):
    return pd.Series(ti.rsi(group["Close"], period), index=group.index[period:])

df["rsi"] = df.groupby("Ticker").apply(rsi, 14).reset_index(level=0, drop=True)

edited Feb 05 '20 at 23:27

answered Feb 05 '20 at 18:52

Stuart

9,597
1
21
30

Yes, these suggestions work smoothly. Your first suggestion is exceptionally faster for calculations than what I had originally utilized. Although, I cannot pass `group["Date"]` as a parameter in `ti.rsi()`. So, I need to think of a way to efficiently join the dates without mismatching order. – Bird_Muffin Feb 05 '20 at 21:09
See edits - I realized `rsi` doesn't take the date. With the final way, you can easily add the date to each group result. – Stuart Feb 05 '20 at 21:13
See further edits - as `rsi` returns only a single array the solution can be greatly simplified by adding the rsi to the existing dataframe. – Stuart Feb 05 '20 at 22:10
This suggestion works, but is marginally slower than apply the function to each group. Also, since the `rsi` function takes a period `n`, the resulting array has less values than the length of the index b/c the array doesn't retain the NaN values. Originally, I had dealt with this by this inefficient method -- see edit update in original post. – Bird_Muffin Feb 05 '20 at 22:31
See further edits... you should be able to offset the index to put the rsi results in the right position. – Stuart Feb 05 '20 at 23:30

Most efficient way to convert a dictionary with list of numpy arrays into pandas dataframe?

1 Answers1