1

I am having a hard time coming up with the correct query to get the data I need, and I wonder whether it's something that can realistically be done using SQL or if I should deal with it in Python.

My first Table (api_results) looks like this

ticker entry date change
AAPL 2 2022-06-09 1.05
TSLA 7 2022-09-09 -0.34

And the second one (historical_data) looks like this, excluding irrelevant columns.

Ticker Date Change
AAPL 2022-06-09 1.05
AAPL 2022-07-09 -0.34
AAPL 2022-08-09 2.5
AAPL 2022-09-09 1.12
AAPL 2022-10-09 ...
AAPL 2022-11-09 ...
MSFT 2022-06-09 ...
MSFT 2022-07-09 ...
MSFT 2022-08-09 ...
MSFT 2022-09-09 ...
MSFT 2022-10-09 ...
MSFT 2022-11-09 ...

I am trying to get the date in api_results.date for every ticker in api_results and see what the sum of changes in a 10-days span is for each ticker in api_results. The non-dynamic query is the following:

SELECT ticker, Date,  sum(change) as change FROM
(
    SELECT change, ticker, Date FROM historical_data
    WHERE ticker = 'T' AND date >= '2013-12-13 00:00:00'
    limit 10
)

Which returns

Ticker Date Change
T 2013-12-13 3.76

The result is correct, but how can I do the same thing for every date-ticker pair in api_results. The idea is to apply a function to every row of the table. What I would do in Python is this:

import sqlite3
from config import db_path
import pandas as pd

connection = sqlite3.connect(db_path)
cursor = connection.cursor() 

historical_data = pd.read_sql("SELECT Date, Ticker, Change from historical_data", connection)
 
api_results = cursor.execute("SELECT ticker, date from api_results").fetchall()

data = []

for ticker, date in api_results:
    index = list(historical_data['Date'].index(date))
    data.append(historical_data.iloc[index:index+10]['Change'].sum())

This appears to be working just fine, but it takes way too long. After two hours, the loop was still going. Mind you, that historical data has 1M+ rows of data and 30 columns.

Should I just find a better way of doing it in Python, like vectorizing a function over historical data, or even just building a data frame and shifting the change appropriately, or can this be done using SQL? Thank you for your time and help.

cdbullard
  • 143
  • 2
  • 10
  • This should be possible within SQLite using the window functions, see https://www.sqlite.org/windowfunctions.html. Conceptually, you need the rolling sum over your historical data change ordered by date and grouped by ticker from which you select the relevant rows with ticker and date from your API results. See https://stackoverflow.com/questions/5606560/how-do-i-calculate-a-running-sum-on-a-sqlite-query for an example usage of the window functions. – Carlos Horn Jun 11 '22 at 23:22

1 Answers1

3

The query I came up with is the following:

SELECT 
    t1.ticker, t1.date, t2.change_10
FROM 
    api_results AS t1
INNER JOIN (
    SELECT 
        ticker, 
        date, 
        SUM(change) OVER (
            PARTITION BY ticker
            ORDER BY date
            RANGE BETWEEN 0 PRECEDING AND 10 FOLLOWING
        ) AS change_10
    FROM 
        historical_data
    ORDER BY 
        ticker, date
) AS t2
ON t1.ticker = t2.ticker AND t1.date = t2.date

where sub-query builds the rolling sum over changes in the historical data for 10 following entries, as you do with index:index+10 which I called change_10. In order to get the relevant bits from the sub-query, I use an inner join on a query on api_results to get the needed ticker and date combinations.

Carlos Horn
  • 1,115
  • 4
  • 17
  • This works! Thanks a lot, I didn't know about the RANGE keyword in SQL, looks like I've got some more studying to do. – rickturner2001 Jun 12 '22 at 00:34
  • 1
    I think I made an indexing error, and it should be `RANGE BETWEEN CURRENT ROW AND 9 FOLLOWING` to build a window of 10. – Carlos Horn Jun 12 '22 at 00:52
  • Yeah so the structure of the query is exactly what I was looking for but the results seem to be a bit off. for instance my first 10 changes for ticker "t" starting at date "2013-12-13 00:00:00" are -0.11803251423189165 0.8862601256577118 -0.8784745559542451 1.5657267193521784 0.20360562430146167 -0.4354122917803349 1.1370315301767686 0.7494965538242138 0.6008550137148072 0.056884557745906505 which is 3. something but on the query it shows T | 2013-05-29 00:00:00 | -0.7462704798258724 – rickturner2001 Jun 12 '22 at 01:12
  • 1
    Found the problem thanks to the source you gave above. It's supposed to be ROWS BETWEEN instead of RANGE BETWEEN. Thanks again – rickturner2001 Jun 12 '22 at 01:30