5

I've been working with the WRDS/CRSP dataset (a stock price database maintained by UPenn for academic research). I've been downloading the data in Python and inserting it into my local MySQL database.

The data looks like this and has primary key on (quote_date, security_id):

quote_date  security_id tr              accum_index
10-Jan-86   10002       null            1000
13-Jan-86   10002       -0.026595745    973.4042548
14-Jan-86   10002       0.005464481     978.7234036
15-Jan-86   10002       -0.016304348    962.7659569
16-Jan-86   10002       0               962.7659569
17-Jan-86   10002       0               962.7659569
20-Jan-86   10002       0               962.7659569
21-Jan-86   10002       0.005524862     968.0851061
22-Jan-86   10002       -0.005494506    962.765957
23-Jan-86   10002       0               962.765957
24-Jan-86   10002       -0.005524862    957.4468078
27-Jan-86   10002       0.005555556     962.7659569
28-Jan-86   10002       0               962.7659569
29-Jan-86   10002       0               962.7659569
30-Jan-86   10002       0               962.7659569
31-Jan-86   10002       0.027624309     989.3617013
3-Feb-86    10002       0.016129032     1005.319148
4-Feb-86    10002       0.042328041     1047.872338
5-Feb-86    10002       0.04568528      1095.744679

I need to calculate the accum_index column which is basically an index of the total return of the stock and is calculated as follows:

accum_index_t = accum_index_{t-1} * (1 + tr_t)

The table has 80m rows. I've wrote some code to iterating through every security_id and calculate a cumulative product, like so:

select @sid := min(security_id)
from stock_prices;

create temporary table prices (
    quote_date datetime,
    security_id int,
    tr double null,
    accum_index double null,
    PRIMARY KEY (quote_date, security_id)
);

while @sid is not null
do

    select 'security_id', @sid;
    select @accum := null;

    insert into prices
    select quote_date, security_id, tr, accum_index
    from stock_prices
    where security_id = @sid
    order by quote_date asc;

    update prices
    set accum_index = (@accum := ifnull(@accum * (1 + tr), 1000.0));

    update stock_prices p use index(PRIMARY), prices a use index(PRIMARY)
    set p.accum_index = a.accum_index
    where p.security_id = a.security_id
    and p.quote_date = a.quote_date;

    select @sid := min(security_id)
    from stock_prices
    where security_id > @sid;

    delete from prices;

end while;

drop table prices;

But this is too slow, it's taking about a minute per security on my laptop and it will take years to calculate this series. Is there a way to vectorise this?

Cheers, Steve

Lukas Eder
  • 211,314
  • 129
  • 689
  • 1,509
swmfg
  • 1,279
  • 1
  • 10
  • 18
  • Can you show us several rows of sample data, along with the expected output? And please omit columns in the input which are irrelvant to your question. – Tim Biegeleisen Feb 25 '18 at 04:41
  • Yep, added more samples. Thanks – swmfg Feb 25 '18 at 04:53
  • See https://meta.stackoverflow.com/questions/333952/why-should-i-provide-an-mcve-for-what-seems-to-me-to-be-a-very-simple-sql-query – Strawberry Feb 25 '18 at 08:08
  • 1
    Thanks, I got around it by rewriting my code and doing everything in Python, but I will keep that (MCVE) in mind. – swmfg Feb 26 '18 at 04:07
  • 1
    WRDS also makes these data available in PostgreSQL on the cloud. You can connect with your credentials to `wrds-pgdata.wharton.upenn.edu` on port `9737`. – FlipperPA Feb 26 '18 at 11:50
  • Thanks. I've also got around to setting up an unix account. – swmfg Feb 27 '18 at 13:19

2 Answers2

8

If you're using MySQL 8, you could use window functions to create the cumulative product. Unfortunately, there is no PROD() aggregate / window function in any SQL database I'm aware of, but you can emulate it using EXP(SUM(LOG(factor))):

SELECT
  quote_date,
  security_id,
  tr,
  1000 * (EXP(SUM(LOG(1 + COALESCE(tr, 0)))
    OVER (PARTITION BY security_id ORDER BY quote_date)))
    AS accum_index
FROM stock_prices

dbfiddle here.

Lukas Eder
  • 211,314
  • 129
  • 689
  • 1,509
  • I was working on a similar problem and attempted to recreate the example output to ensure I understood it and could not replicate it. It turns out that the 1000 value should be on the outside: `1000 *EXP(SUM(LOG( (1 + COALESCE(tr, 1)))) OVER (PARTITION BY security_id ORDER BY quote_date))` – Chris Aug 16 '21 at 16:43
  • Thanks, @Chris, you're right. More than that, there's also an off-by-1 error. I've fixed this and added a dbfiddle – Lukas Eder Aug 18 '21 at 12:16
  • @LukasEder The fiddle does not seem to produce the correct output? (e.g. 1986-01-13 value is 946.8 vs 973.4). Also the fiddle uses `LN` while the answer above uses `LOG` – Dylan Dec 28 '21 at 21:58
  • 1
    @Dylan: Thanks. In MySQL, `LOG(x)` is the same as `LN(x)`. I've updated the dbfiddle and code to fix the off-by-1 error, which I had fixed wrongly before. – Lukas Eder Dec 29 '21 at 09:47
0

If you're using MySQL 5, you can emulate this function multiplying current with last tr line by line. After that we take the accumulated value of the last line.

tr is percentual value, right now? So lets add 1 to each tr.

The first stored value will be neutral 1.

Try this:

SET @variation = 1;
SET @row_number = 0;

SELECT accumulateTr
FROM
    (SELECT
        @row_number := (@row_number + 1) AS rowNumber,
        @variation := (1 + variation) * @variation AS accumulateTr
     FROM
        prices) accumulatedTrs
ORDER BY rowNumber DESC
LIMIT 1;