-1

I need your opinions on how to solve this matter

I have a small data subset from a large data set.

My subset and my original data set follow this format:

here is a picture of how it looks

here it is written out:

Stock   Date    Price
Pfizer  18-Aug-2009 18.8
Yahoo   19-Aug-2012 27.1
Facebook    20-Aug-2014 77.14
Amazon  21-Aug-2014 683.66
Merck   22-Aug-2005 22.9
Ford    23-Aug-2003 20.1

Let’s say my subset, is a set that I specifically selected.

I want to take this subset and let’s name it (MoneyFA). I want to set the prices in the MoneyFA to be T0. And let’s say the highest price in the previous year is T-1, and the year before that is t-2. And the price the year after is T1, and the year after is T2.

Now, how do I program a program that takes my subset which has variable dates and then it starts searching against the main data set (which has tons of data and let's name it Moneypool) for T values. And let say our window is 10 years before and 10 years after T0. (Keep in mind, a year might have several prices in the main data set for a certain company. We need it to select the highest price in that year)

I want it to give me a table or matrix similar to this (if there’s no value in the main data set, it could be kept empty or NA) :

Here is a picture of how it looks

Here it is written out:

Stock   T-10    T-9 T-8 T-7 T-6 T-5 T-4 T-3 T-2 T-1 T0  T+1 T+2 T+3 T+4 T+5 T+6 T+7 T+8 T+9 T+10
Pfizer  37.12   24.9        19.23       27.01   21.05   17.12   19.88   17.9    18.8    20.91   23.4        24.9                        
Yahoo           9.8 18.9    19.7        12.5    22.6    15.8        27.1                                        
Facebook                                35.86   49.62   63.38   77.14   90.9                                    
Amazon              208.9   533.42  558.46  600.2   608.54  700.1   658.62  683.66  708.7   733.74  809.11                          
Merck   41.1    39.28   37.46   35.64   33.82   32  30.18   28.36   26.54   38.7    22.9    21.08   29.8                                
Ford    66.1    61.5    56.9    52.3    47.7    43.1    38.5    33.9    29.3    24.7    20.1    15.5    10.9    6.3 1.7 4.2 7.3 8.9 10.5    12.1    11.8
Aurèle
  • 12,545
  • 1
  • 31
  • 49
Nemo
  • 3
  • 3
  • Can you provide a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)? – Samuel Mar 27 '17 at 07:19

1 Answers1

0

Here is an example using data.table. The Year is extracted from the Date, then the max Price (per Stock and Year) is calculated on Years within 10 years from year_zero. Year is then expressed relative to year_zero (in T) and results are then cast in wide format:

library(data.table)
d <- structure(list(Stock = c("Pfizer", "Yahoo", "Facebook", "Amazon", 
"Merck", "Ford"), Date = c("18-Aug-2009", "19-Aug-2012", "20-Aug-2014", 
"21-Aug-2014", "22-Aug-2005", "23-Aug-2003"), Price = c(18.8, 
27.1, 77.14, 683.66, 22.9, 20.1)), .Names = c("Stock", "Date", 
"Price"), row.names = c(NA, -6L), class = "data.frame")

setDT(d)
year_zero <- 2010
# Extract Year from Date
d[, Year := as.integer(substr(Date, start = 8, 12))]
# Filter to include year_zero +/- 10 years and get max price per stock per year
d <- d[Year >= year_zero - 10 & Year <= year_zero + 10,
  .(Max_Price = max(Price, na.rm = TRUE)), by = .(Stock, Year)]
# Add T as interval to year_zero (and convert to factor in order to get all 
# values from -10 to 10
d[, "T" := factor(Year - year_zero, levels = seq(-10, 10), ordered = TRUE)]
# Cast to wide format (fill missing values with NA)
dcast(d, Stock~T, value.var = "Max_Price", drop = FALSE)
# Cast to wide format (fill missing values with "")
dcast(d, Stock~T, value.var = "Max_Price", drop = FALSE, fun.aggregate = paste0,
  fill = "")

The output of the first dcast is:

      Stock -10 -9 -8   -7 -6   -5 -4 -3 -2   -1  0  1    2  3      4  5  6  7  8  9 10
1:   Amazon  NA NA NA   NA NA   NA NA NA NA   NA NA NA   NA NA 683.66 NA NA NA NA NA NA
2: Facebook  NA NA NA   NA NA   NA NA NA NA   NA NA NA   NA NA  77.14 NA NA NA NA NA NA
3:     Ford  NA NA NA 20.1 NA   NA NA NA NA   NA NA NA   NA NA     NA NA NA NA NA NA NA
4:    Merck  NA NA NA   NA NA 22.9 NA NA NA   NA NA NA   NA NA     NA NA NA NA NA NA NA
5:   Pfizer  NA NA NA   NA NA   NA NA NA NA 18.8 NA NA   NA NA     NA NA NA NA NA NA NA
6:    Yahoo  NA NA NA   NA NA   NA NA NA NA   NA NA NA 27.1 NA     NA NA NA NA NA NA NA
  • This looks so good, the only thing is, T0 should be assigned by me. So the first picture actually contains T0 values. So it is not linked with a year. Thank you so much Kristoffer. – Nemo Mar 27 '17 at 09:17
  • You are welcome! You can use the year_zero variable to specify the year for T0 if that is what you want. If you do not want any data for T0 you can change the filtering to d[Year >= year_zero - 10 & Year <= year_zero + 10 & Year != year_zero, ....]. Then you can assign data to T0 yourself. – Kristoffer Winther Balling Mar 27 '17 at 09:25
  • You are brilliant, sir! I'll let you know how it goes. – Nemo Mar 27 '17 at 09:31
  • Hi Kristoffer! Unfortunately the model is not working well for me. The problem I'm facing is when I try to integrate it to the huge main data set which contains many companies other than the ones I'm really interested in. Also, I would like the T0 value to be the same as my initial small subset. Then I'd like it to go back to the main data set to fill out the two sides of the T0 value from the master data set. Any suggestions? – Nemo Mar 28 '17 at 06:15
  • Any idea on how to let it fill out the missing data from the master data file? which I have in another csv file? (it contains many different companies though) – Nemo Mar 29 '17 at 06:08
  • To exclude or include companies I would use a data.table filter, e.g. dt[company %in% c("Merck", "Pfizer" ...)]. If you have your T0 data in a separate data.table I would merge with the T-10 to T10 data using company as key. – Kristoffer Winther Balling Mar 29 '17 at 08:41
  • I have a question: How can I make T0 different for each company. So the model calculates Separate T0 for each company uniquely. ( I have the T0s for each company independently) For example let say for merck I have the year 2006 as my t0, for fb, I have the year 2011 as my t0. And so on. – Nemo Aug 15 '17 at 08:57
  • You say "You are welcome! You can use the year_zero variable to specify the year for T0 if that is what you want. If you do not want any data for T0 you can change the filtering to d[Year >= year_zero - 10 & Year <= year_zero + 10 & Year != year_zero, ....]. Then you can assign data to T0 yourself." How can I assign it myself? – Nemo Aug 15 '17 at 09:05