0

I'm trying to do a zoo merge between stock prices from selected trading days and observations about those same stocks (we call these "Nx observations") made on the same days. Sometimes do not have Nx observations on stock trading days and sometimes we have Nx observations on non-trading days. We want to place an "NA" where we do not have any Nx observations on trading days but eliminate Nx observations where we have them on non-trading day since without trading data for the same day, Nx observations are useless.

The following SO question is close to mine, but I would characterize that question as REPLACING missing data, whereas my objective is to truly eliminate observations made on non-trading days (if necessary, we can change the process by which Nx observations are taken, but it would be a much less expensive solution to leave it alone).

merge data frames to eliminate missing observations

The script I have prepared to illustrate follows (I'm new to R and SO; all suggestions welcome):

# create Stk_data data.frame for use in the Stack Overflow question
Date_Stk <- c("1/2/13", "1/3/13", "1/4/13", "1/7/13", "1/8/13") # dates for stock prices used in the example
ABC_Stk <- c(65.73, 66.85, 66.92, 66.60, 66.07) # stock prices for tkr ABC for Jan 1 2013 through Jan 8 2013
DEF_Stk <- c(42.98, 42.92, 43.47, 43.16, 43.71) # stock prices for tkr DEF for Jan 1 2013 through Jan 8 2013
GHI_Stk <- c(32.18, 31.73, 32.43, 32.13, 32.18) # stock prices for tkr GHI for Jan 1 2013 through Jan 8 2013
Stk_data <- data.frame(Date_Stk, ABC_Stk, DEF_Stk, GHI_Stk) # create the stock price data.frame

# create Nx_data data.frame for use in the Stack Overflow question
Date_Nx <- c("1/2/13", "1/4/13", "1/5/13", "1/6/13", "1/7/13", "1/8/13") # dates for Nx Observations used in the example
ABC_Nx <- c(51.42857, 51.67565, 57.61905, 57.78349, 58.57143, 58.99564) # Nx scores for stock ABC for Jan 1 2013 through Jan 8 2013
DEF_Nx <- c(35.23809, 36.66667, 28.57142, 28.51778, 27.23150, 26.94331) # Nx scores for stock DEF for Jan 1 2013 through Jan 8 2013
GHI_Nx <- c(7.14256, 8.44573, 6.25344, 6.00423, 5.99239, 6.10034) # Nx scores for stock GHI for Jan 1 2013 through Jan 8 2013
Nx_data <- data.frame(Date_Nx, ABC_Nx, DEF_Nx, GHI_Nx) # create the Nx scores data.frame

# create zoo objects & merge
z.Stk_data <- zoo(Stk_data, as.Date(as.character(Stk_data[, 1]), format = "%m/%d/%Y"))
z.Nx_data <- zoo(Nx_data, as.Date(as.character(Nx_data[, 1]), format = "%m/%d/%Y"))
z.data.outer <- merge(z.Stk_data, z.Nx_data)

The NAs on Jan 3 2013 for the Nx observations are fine (we'll use the na.locf) but we need to eliminate the Nx observations that appear on Jan 5 and 6 as well as the associated NAs in the Stock price section of the zoo objects.

I've read the R Documentation for merge.zoo regarding the use of "all": that its use "allows intersection, union and left and right joins to be expressed". But trying all combinations of the following use of "all" yielded the same results (as to why would be a secondary question).

z.data.outer <- zoo(merge(x = Stk_data, y = Nx_data, all.x = FALSE)) # try using "all"

While I would appreciate comments on the secondary question, I'm primarily interested in learning how to eliminate the extraneous Nx observations on days when there is no trading of stocks. Thanks. (And thanks in general to the community for all the great explanations of R!)

Community
  • 1
  • 1
W Barker
  • 324
  • 2
  • 8

2 Answers2

2

The all argument of merge.zoo must be (quoting from the help file):

logical vector having the same length as the number of "zoo" objects to be merged 
(otherwise expanded)

and you want to keep all rows from the first argument but not the second so its value should be c(TRUE, FALSE).

merge(z.Stk_data, z.Nx_data, all = c(TRUE, FALSE))

The reason for the change in all syntax for merge.zoo relative to merge.data.frame is that merge.zoo can merge any number of arguments whereas merge.data.frame only handles two so the syntax had to be extended to handle that.

Also note that %Y should have been %y in the question's code.

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • I wasn't able to answer @Henrik's good question about the need for the zoo package; I guess you have and I've learned something from it. I was just working on applying the earlier answer to the real problem that, unlike the simplified question, has dates spanning years; the merged dates were not in order, but are when I go back and apply your solution to my zoo objects. Also, %y rather than %Y: noted. Thanks. – W Barker Feb 26 '14 at 20:21
0

I hope I have understood your desired output correctly ("NAs on Jan 3 2013 for the Nx observations are fine"; "eliminate [...] observations that appear on Jan 5 and 6"). I don't quite see the need for zoo in the merging step.

merge(Stk_data, Nx_data, by.x = "Date_Stk", by.y = "Date_Nx", all.x = TRUE)

#   Date_Stk ABC_Stk DEF_Stk GHI_Stk   ABC_Nx   DEF_Nx  GHI_Nx
# 1   1/2/13   65.73   42.98   32.18 51.42857 35.23809 7.14256
# 2   1/3/13   66.85   42.92   31.73       NA       NA      NA
# 3   1/4/13   66.92   43.47   32.43 51.67565 36.66667 8.44573
# 4   1/7/13   66.60   43.16   32.13 58.57143 27.23150 5.99239
# 5   1/8/13   66.07   43.71   32.18 58.99564 26.94331 6.10034 
Henrik
  • 65,555
  • 14
  • 143
  • 159
  • Thanks very much. Why doesn't the zoo date index appear on your solution? Also just for my own learning, I substituted "all.y = TRUE". The NA's appear in the expected places, but the dates are out of order. Any thoughts as to why? Thanks again. – W Barker Feb 26 '14 at 15:01
  • You are welcome! I don't understand what you mean by " Why doesn't the zoo date index appear". I use your original data frames, not `zoo` objects. About the order (or lack thereof): merge sometimes messes up the order, despite you try to use the `sort` argument. If your 'merging variables' have the same name in the two data sets (e.g. you could could name "Date_Stk" and "Date_Nx" both 'date' for example - they are both...eeeh...dates), you may try `join` in package `plyr`. From `?join`: "Unlike `merge`, preserves the order of x no matter what join type is used." – Henrik Feb 26 '14 at 15:13
  • Thanks @Henrik. While the other solution seems to work fine, I'm glad to have your additional comments since I'm still pretty far down on the R learning curve. – W Barker Feb 27 '14 at 16:52