I am having a hard time coming up with the correct query to get the data I need, and I wonder whether it's something that can realistically be done using SQL or if I should deal with it in Python.
My first Table (api_results) looks like this
ticker | entry | date | change |
---|---|---|---|
AAPL | 2 | 2022-06-09 | 1.05 |
TSLA | 7 | 2022-09-09 | -0.34 |
And the second one (historical_data) looks like this, excluding irrelevant columns.
Ticker | Date | Change |
---|---|---|
AAPL | 2022-06-09 | 1.05 |
AAPL | 2022-07-09 | -0.34 |
AAPL | 2022-08-09 | 2.5 |
AAPL | 2022-09-09 | 1.12 |
AAPL | 2022-10-09 | ... |
AAPL | 2022-11-09 | ... |
MSFT | 2022-06-09 | ... |
MSFT | 2022-07-09 | ... |
MSFT | 2022-08-09 | ... |
MSFT | 2022-09-09 | ... |
MSFT | 2022-10-09 | ... |
MSFT | 2022-11-09 | ... |
I am trying to get the date in api_results.date for every ticker in api_results and see what the sum of changes in a 10-days span is for each ticker in api_results. The non-dynamic query is the following:
SELECT ticker, Date, sum(change) as change FROM
(
SELECT change, ticker, Date FROM historical_data
WHERE ticker = 'T' AND date >= '2013-12-13 00:00:00'
limit 10
)
Which returns
Ticker | Date | Change |
---|---|---|
T | 2013-12-13 | 3.76 |
The result is correct, but how can I do the same thing for every date-ticker pair in api_results. The idea is to apply a function to every row of the table. What I would do in Python is this:
import sqlite3
from config import db_path
import pandas as pd
connection = sqlite3.connect(db_path)
cursor = connection.cursor()
historical_data = pd.read_sql("SELECT Date, Ticker, Change from historical_data", connection)
api_results = cursor.execute("SELECT ticker, date from api_results").fetchall()
data = []
for ticker, date in api_results:
index = list(historical_data['Date'].index(date))
data.append(historical_data.iloc[index:index+10]['Change'].sum())
This appears to be working just fine, but it takes way too long. After two hours, the loop was still going. Mind you, that historical data has 1M+ rows of data and 30 columns.
Should I just find a better way of doing it in Python, like vectorizing a function over historical data, or even just building a data frame and shifting the change appropriately, or can this be done using SQL? Thank you for your time and help.