I am trying to do bulk calculations on multiple stock symbols over many time periods using numpy vectorization, but I am not sure if how I am approaching the problem is most efficient. I am using the "tulipy" technical indicator library to perform calculations on the "Close" price of different stocks.
Here is what the source data would look like as a dataframe:
Ticker Date Close
0 A 1999-11-18 31.473534
1 A 1999-11-19 28.880543
2 A 1999-11-22 31.473534
3 A 1999-11-23 28.612303
4 A 1999-11-24 29.372318
Here is reproducible code:
d = {'Ticker': ['A','A','A','A','A','A','A','A','A','A','A','A','A','A','A'], 'Date': ['1999-11-18', '1999-11-19', '1999-11-22', '1999-11-23', '1999-11-24','1999-11-26', '1999-11-29', '1999-11-30', '1999-12-01', '1999-12-02','1999-12-03', '1999-12-06', '1999-12-07', '1999-12-08', '1999-12-09'], 'Close': ['31.473534','28.880543','31.473534','28.612303','29.372318','29.461731','30.132332','30.177038','30.713520','31.562946','31.831188','32.725323','32.367668','32.322960','32.770027']}
df = pd.DataFrame(data=d)
Calculations would be done on the floating point numbers.
Then, I am creating a numpy array from the queried results and finding the unique values (tickers) to iterate over for indexing purposes:
import pandas as pd
import numpy as np
import tulipy as ti
a = np.array([(df['Ticker']),(df['Date']),(df['Close'])])
x_unique = np.unique(a[0], return_counts=True, return_index=True)
Here is what the source data would look like as a numpy array:
array([['A', Timestamp('1999-11-18 00:00:00'), 31.473533630371094],
['A', Timestamp('1999-11-19 00:00:00'), 28.880542755126953],
['A', Timestamp('1999-11-22 00:00:00'), 31.473533630371094],
Then, I am iterating over each unique ticker and calculating the Relative Strength Index (RSI) from the "tulipy" technical indicators library, while also maintaining the respective date for each price data point, and storing those values in a dictionary:
dictt = {}
start = 0
d = 0
for i in x_unique[0]:
dictt[i] = [ti.rsi(a[2, start:start+x_unique[2][d]].astype(float), 14), a[1, start:start+x_unique[2][d]]]
start = start+x_unique[2][d]
d+=1
All of this works fine and is quite quick, but I feel as if there is a numpy way of iterating over each unique ticker rather than using the "for i in x_unique[0]", as this could become resource intensive once I implement 6,000+ tickers. Here is the dictionary (dictt) sample output:
{'A': [array([54.98280996, 51.72842265, 53.80685853, ..., 71.42460267,
68.75692746, 65.20371964]),
array([Timestamp('1999-11-18 00:00:00'), Timestamp('1999-11-19 00:00:00'),
Timestamp('1999-11-22 00:00:00'), ...,
Timestamp('2019-11-27 00:00:00'), Timestamp('2019-11-29 00:00:00'),
Timestamp('2019-12-02 00:00:00')], dtype=object),
array([31.47353363, 30.89731344, 31.02536237, ..., 79.59926089,
79.85942439, 79.96844085]),
The next issue is how do I get this into a pandas dataframe and/or a sql database in the most efficient manner, because getting the data into a readable manner is where most of the script's inefficiency lies. Here is the output of me converting it to a pandas dataframe, which is not in the desired format that I would want:
pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in dictt.items() ]))
A AAPL AEP AMAT ATVI
0 [54.98280995952094, 51.72842265345178, 53.8068... [64.49276119489626, 58.970440366273245, 54.374... [48.78048780487805, 38.63298662704309, 37.7864... [41.830064699495054, 45.295506922269944, 44.34... [45.23809231868367, 45.23809231868366, 58.8677...
1 [1999-11-18 00:00:00, 1999-11-19 00:00:00, 199... [1980-12-12 00:00:00, 1980-12-15 00:00:00, 198... [1970-01-02 00:00:00, 1970-01-05 00:00:00, 197... [1980-03-17 00:00:00, 1980-03-18 00:00:00, 198... [1993-10-25 00:00:00, 1993-10-26 00:00:00, 199...
2 [31.473533630371094, 30.897313435872395, 31.02... [0.5133928656578064, 0.5074404809210036, 0.494... [30.625, 30.708333333333332, 30.71759259259259... [0.0954861119389534, 0.09510030928585264, 0.09... [0.9375, 0.9212962918811374, 0.908693407788688...
I would prefer it to output like this:
rsi date ema8 ema20 ema50 ema100 ema200
A 54.9828 1999-11-18 31.4735 31.4735 31.4735 31.4735 31.4735
A 51.7284 1999-11-19 30.8973 31.2266 31.3718 31.4222 31.4477
So my main questions are:
- Is there a better way to perform vectorized calculations on this dataset while maintaining the order/matching of stock symbols, dates, and their calculated values?
- Could I disregard storing all the data in a dictionary and keep everything in a numpy array?
- Can I directly store the values from the numpy array into a sql database or pandas dataframe that is tabular in nature?
UPDATE for joining dates and rsi on dataframe:
df_dictt = pd.DataFrame(dictt).T
output = pd.DataFrame()
for i in range(len(df_dictt)):
df_join = pd.DataFrame(df_dictt[1][i]).rename(columns={1:'date'}).join(pd.DataFrame(df_dictt[0][i]).rename(columns={0:'rsi'}), how='left')
df_join['rsi'] = df_join['rsi'].shift(14)
df_join['ticker'] = df_dictt.index[i]
output = output.append(df_join, ignore_index=False)