0

I am trying to use python pandas to compute:

10 day and 30 day cumulative % performance in (stock ticker RTH "minus" stock tiker SPY) after certain performance threshold in stock ticker USO occurs (=>10% in a 5-day window)

Here is my code:

import pandas as pd
import datetime
import pandas_datareader.data as web
from pandas import Series, DataFrame

start = datetime.datetime(2012, 4, 1)
end = datetime.datetime.now()

dfcomp = web.DataReader(['USO', 'RTH', 'SPY'],'yahoo',start=start,end=end)['Adj Close']

dfcomp_daily_returns = dfcomp.pct_change()

dfcomp_daily_returns = dfcomp_daily_returns.dropna().copy()

dfcomp_daily_returns.head()

Symbols USO RTH SPY
Date            
2012-04-03  -0.009243   -0.004758   -0.004089
2012-04-04  -0.020676   -0.007411   -0.009911
2012-04-05  0.010814    0.003372    -0.000501
2012-04-09  -0.007387   -0.006961   -0.011231
2012-04-10  -0.011804   -0.018613   -0.016785

excel_sheet11

I added several more rows so it might be easier to work with if someone can help

Symbols USO RTH SPY
Date            
2012-04-03  -0.009243   -0.004758   -0.004089
2012-04-04  -0.020676   -0.007411   -0.009911
2012-04-05  0.010814    0.003372    -0.000501
2012-04-09  -0.007387   -0.006961   -0.011231
2012-04-10  -0.011804   -0.018612   -0.016785
2012-04-11  0.012984    0.010345    0.008095
2012-04-12  0.011023    0.010970    0.013065
2012-04-13  -0.007353   -0.004823   -0.011888
2012-04-16  0.000766    0.004362    -0.000656
2012-04-17  0.011741    0.015440    0.014812
2012-04-18  -0.014884   -0.000951   -0.003379
2012-04-19  -0.002305   -0.006183   -0.006421
2012-04-20  0.011037    0.002632    0.001670
2012-04-23  -0.009139   -0.015513   -0.008409
2012-04-24  0.003587    -0.004364   0.003802
vvvvv
  • 25,404
  • 19
  • 49
  • 81
new2coding
  • 11
  • 1
  • 8
  • Not sure I understand your computational logic, can you illustrate with the desired output given your input? – itprorh66 Apr 08 '22 at 17:17
  • thanks so much for responding itprorh66! basically what i want to solve for is..over the last 10 years..when USO moves up by 10% or more in a 5 day trading window, what is the resulting 10-day and 30-day resulting cumulative % performance of the RTH net of SPY – new2coding Apr 08 '22 at 17:30
  • itproh66 et al. - i just added a snapshot of excel ("excel_sheet11") to show what i am trying to solve for. basically when u get a "TRUE" value in that right column - i want to add up all those subsequent yellow boxes and output a % value. that would be 10 days of (RTH minus SPY) of cumulative % performance starting the day AFTER "TRUE" occurs. id also like to get the 30-day cumulative % performance for wen "TRUE" occurs. pls let me know if i am still not being clear. thanks so much for the help! – new2coding Apr 09 '22 at 00:11
  • it would be nice, if you could provide the example data to reproduce your problem (in general for posting here, try to present the data in your post rather than in a picture). I think it is clear now what you want to achieve. – Rabinzel Apr 09 '22 at 00:20
  • Rabinzel - thank you and i agree with you. i tried to paste the excel values in to the above but it wouldn't allow me to bc i am new. i wud add those 2 right-most columns - that i show in the excel snapshot - to my pandas dataframe but i do not know how and ive tried all day to figure it out. i added some more rows to the dataframe above for ease for working with it if someone cares to take a stab again thank you v much! – new2coding Apr 09 '22 at 01:09

1 Answers1

1

I think this is a solution to your question. Note that I copied your code up to dropna(), and have also used import numpy as np. You don't need to use from pandas import Series, DataFrame, especially as you have already used import pandas as pd.

The main computations use rolling, apply and where.

# 5-day cumulative %
dfcomp_daily_returns["5_day_cum_%"] = dfcomp_daily_returns["USO"].rolling(5).apply(lambda x: np.prod(1+x)-1)
# RTH - SPY
dfcomp_daily_returns["RTH-SPY"] = dfcomp_daily_returns["RTH"] - dfcomp_daily_returns["SPY"]
# 10-day cumulative %
dfcomp_daily_returns["output_10"] = dfcomp_daily_returns["RTH-SPY"].rolling(10).apply(lambda x: np.prod(1+x)-1).shift(-10).where(dfcomp_daily_returns["5_day_cum_%"] > 0.1, np.nan)
# 30-day cumulative %
dfcomp_daily_returns["output_30"] = dfcomp_daily_returns["RTH-SPY"].rolling(30).apply(lambda x: np.prod(1+x)-1).shift(-30).where(dfcomp_daily_returns["5_day_cum_%"] > 0.1, np.nan)

I won't print the output, given that there are thousands of rows, and the occurrences of ["5_day_cum_%"] > 0.1 are irregular.

How this code works:

  • The 5_day_cum_% is calculated using a rolling 5-day window, with the product of the values in this window.
  • RTH-SPY is column RTH "minus" column SPY.
  • The output calculates the rolling product of RTH-SPY, then using .shift() for forward rolling (it is not possible to use .rolling() to roll forwards. This idea came from Daniel Manso here. Finally, .where() is used to only keep these values on the condition that [5_day_cum_%] > 0.1 (or 10%), returning np.nan otherwise.

Additions from comments

From your additions in the comments, here are two options for each of those (one using pd.where again, the other just using standard pandas filtering (I'm not sure if it has an actual name). In both, the standard filtering is shorter.

A list of all the dates:

# Option 1: pd.where    
list(dfcomp_daily_returns.where(dfcomp_daily_returns["5_day_cum_%"] > 0.1, np.nan).dropna(subset=["5_day_cum_%"]).index)
# Option 2: standard pandas filtering
list(dfcomp_daily_returns[dfcomp_daily_returns["5_day_cum_%"] > 0.1].index)

A dataframe of only those with 5-day return greater than 10%:

# Option 1: pd.where
dfcomp_daily_returns.where(dfcomp_daily_returns["5_day_cum_%"] > 0.1, np.nan).dropna(subset=["5_day_cum_%"])[["5_day_cum_%", "output_10", "output_30"]]
# Option 2: standard pandas row filtering
dfcomp_daily_returns[dfcomp_daily_returns["5_day_cum_%"] > 0.1][["5_day_cum_%", "output_10", "output_30"]]
Rawson
  • 2,637
  • 1
  • 5
  • 14
  • S Rawson - wow! amazing u are (if are into Starwars ;) ) it looks great. how now would i have pandas 1) output the actual dates when the 10% in a 5 day period occurred and 2) give that 5-day aggregation value along w the accompanying 3) 10-day and 4) 30-day out puts – new2coding Apr 09 '22 at 13:19
  • Glad it works I am! I have added these additions in the answer above. The second code includes all three columns (5-day, 10-day and 30-day). If you change the column selection at the end to just one, you can choose what you want to see. – Rawson Apr 09 '22 at 15:10
  • As you have the list already, from the additions, if you enclosed the list with `len()`, i.e. `len(list(dfcomp_daily_returns[dfcomp_daily_returns["5_day_cum_%"] > 0.1].index))`, you would get the number of occurances (53). `len()` is the length, and the length of the list of dates will be the number of ocurances. – Rawson Apr 10 '22 at 21:13
  • S Rawson - Amazing again! I've been practicing what you have shown me. TYVM! What about if wanted to output: 1) The actual # of dates where rolling 5-day USO > 10%. (i.e. the number of occurrences). Some count function I suppose. 2) The Avg. of those sums 3) The Avg. of those 10-Day rolling sums 4) The avg of those 30-day rolling sums.....and I assume if I wanted to perform all these computations where USO 5-day rolling is >= -10% it would jsut be: (dfcomp_daily_returns["5_day_cum_%"] > -(0.1), np.nan)" vs. (dfcomp_daily_returns["5_day_cum_%"] > 0.1, np.nan). what's the right way to think.. – new2coding Apr 10 '22 at 21:16
  • ..about how python treats negative numbers – new2coding Apr 10 '22 at 21:17
  • `(dfcomp_daily_returns["5_day_cum_%"] >= -0.1, np.nan)` works, no need to put the `()`. You also want the `>=` for you great than or equal to. I assume you do mean better than or equal to -10%, not worse than or equal to (which would be `<=`. – Rawson Apr 10 '22 at 21:24
  • To get an average of a column: `dfcomp_daily_returns["output_30"].mean()`. You can change the column name to suit your purposes, or even use `[["output_30", "output_10"]]` to get both at once. – Rawson Apr 10 '22 at 21:24
  • By "the average of those sums", I assume you mean of the rolling 5-day? In which case, just use that column name in the code in the comment above. – Rawson Apr 10 '22 at 21:26
  • ah yes. I want "worse than or equal to" -10%. ie -10%, -11%...i want to see when USO has a large move lower (-10% or in excess of -10% to the downside.) – new2coding Apr 10 '22 at 21:36
  • Then you will want to change that to `<= -0.1`, for "less than or equal to -10%". – Rawson Apr 10 '22 at 21:38
  • If you think this answer answers your question, you can [accept this](https://stackoverflow.com/help/someone-answers), so others know that this has been solved if they come across it. – Rawson Apr 10 '22 at 21:47
  • Technical python q: After I get the occurrences (53) for the # of 5-day rolling USO "len(list(dfcomp_daily_returns[dfcomp_daily_returns["5_day_cum_%"] > 0.1].index))"...and I want to know the mean of those occurrences "dfcomp_daily_returns["5_day_cum_%"].mean()" how do I know for sure i'm getting the mean of those 53 occurrences and not the mean of all the of the standard 5-day rolling USO column (which contains many rows where the 5-day rolling is less than 10% etc.). I guess bc of chronology? is there a smart way in general to guard against querying the wrong set of numbers? – new2coding Apr 10 '22 at 21:57
  • Ah! You will need to do this: `dfcomp_daily_returns[dfcomp_daily_returns ["5_day_cum_%"] > 0.1]["5_day_cum_%"].mean()` to get only those with more than 10%. That was a good spot, I had forgotten that that column had more data than the output columns. Alternatively, you could use `dfcomp_daily_returns.dropna(subset=["output_30"])["5_day_cum_%"].mean()` as only those which meet the requirement have something in `output_30`. These should work, but I am on my phone now, so can't run these to try myself. – Rawson Apr 10 '22 at 22:11
  • The dfcomp_daily_returns[dfcomp_daily_returns ["5_day_cum_%"] > 0.1]["5_day_cum_%"].mean() seems like it worked. The 53 occurrence avg went from from like .10% to 15%...You go on to say "Alternatively..." I want to make sure I understand what this is exactly:"dfcomp_daily_returns.dropna(subset=["output_30"])["5_day_cum_%"].mean()"...we were trying to solve for the mean of the 53 rolling 5-day USO occurrences. So why is there "output_30" code in there? Guessing it is some sort of filtering etc. for what we were trying to do in the code where we solved for 53 etc. I don't want 2 miss your point! – new2coding Apr 10 '22 at 22:47
  • They should result in exactly the same answer. Because the dataframe only has data in column output_30 when the condition in met, if you dropna in the column you will only be left with rows where the condition is met. So yes, it is just another way of filtering the dataframe. I have found that it is very much normal to have more than one way of completing the same thing in Python! – Rawson Apr 11 '22 at 06:29
  • Tricky tricky. The 2 answers are actually different when I test it out. The reason being is because this dataframe's data runs up to the current day. As such, there have been many +10%/+10%+ days over the last month AND importantly for our conversation - within the last 30 days. Meaning - there are +10% USO 5-day rolling averages in the last 30 days (as recently as 3/23/22) where there haven't been 30 days of rolling data yet for (RTH-SPY). – new2coding Apr 12 '22 at 20:10
  • To close this out - How can I make sure that I am able to list the 53 instances of the 10 day rolling (RTH-SPY) with their accompanying outputs and then the average for the 53 instances. Let's do the same - but a separate computation - for the 30-day figure to keep the data/output clean. (After the above comment of mine - we know that the 30-Day rolling (RTH-USO) will be incorrect but that's OK.) – new2coding Apr 12 '22 at 20:17
  • Using the last block of code in the answer, if you change the `[["5_day_cum%", "output_10", "output_30"]]` to `["output_10"]` you will return all instances of the `5_day_cum_% > 0.1`. You can then add `.mean()` to this to find the average. And then the same with the `output_30`. – Rawson Apr 13 '22 at 07:23