non-understandable behavior ccf function stats package

Question

The point of this question is to show that ccf is giving wrong answers

I am writing a Shiny app. In one tab I want to plot the cross co-variance function using the ccf function from the stats package.

However, I found weird behavior in this function:

x <- rnorm(100)
y <- lag(x,-5) + rnorm(100)
ccf(y, x, ylab='CCovF', type='covariance')

yields the correct cross co-variance function plot:

However, I changed the type of y and the plot was then wrong:

y <- as.numeric(y)
ccf(y, x, ylab='CCovF', type='covariance')

Does someone have any idea what is happening ? What is causing this behavior and how to remedy it ?

In the app the y input will not have the tsp attribute and it will be just a numeric type.

This is actually the most important part of the question, the fact that the y input will have a numeric type and not an atomic type.

I tried using lag() on y to make it gain back its attribute, but the function still didn't work:

ccf(lag(y,0), x, type = "covariance")

enter image description here

If this function only gives a correct answer if the y series is written intentionally a lag of the x series and not when it is naturally a lag of then x series then this function serves nothing in the real life.

because the point is to take the y vector as is. Let's say that I have two time series 1 and 2 that are in numeric variables x and y and y is lagged 5 steps to x. It is naturally like this. It was not induced. I only used these x and y to show that there is an error in ccf. Does that make sense ? — user9396820, Aug 15 '18 at 17:55
I do not want to change the y vector let's just suppose I created a y vector manually and that it happened to be 5 steps lagged to x. The ccf function would not catch it ! making this function obsolete. CCF should give the correct plot regardless of the type/attribute of the input. What is the point in it giving only good results if one series is intentionally lagged ? — user9396820, Aug 15 '18 at 17:59
It can't catch it if there is no information it could use to guess what the lag is. — Ben Bolker, Aug 15 '18 at 20:11

Ben Bolker · Accepted Answer · 2018-08-15T20:37:57.993

Consider what R "knows" about these vectors:

set.seed(101)
x <- rnorm(100)
y <- lag(x,-5)+rnorm(100)
y_num <- as.numeric(y)

As @李哲源 points out in the comments, adding the lag() element makes y a time-series object, with a known lag:

str(y)
##  num [1:100] -0.058 -0.0397 1.4585 1.3871 1.0575 ...
## - attr(*, "tsp")= num [1:3] 6 105 1

The tsp attribute is a vector consisting of a starting value (6, i.e. a lag of 5), an ending value, and the frequency (see ?tsp).

str(x)
##  num [1:100] -0.326 0.552 -0.675 0.214 0.311 ...

In contrast, x is just a numeric vector. In the absence of a tsp attribute, R has no way of knowing what its lag is, and so it will assume that it starts at time 1.

When you convert y to numeric, it loses its tsp attribute, so R no longer knows what its lag is. The only sensible guess is that it starts at time 1 as well (i.e., lag 0).

str(y_num)
## num [1:100] -0.058 -0.0397 1.4585 1.3871 1.0575 ...

If you have external information about the relative lag of your x and y variables, you must tell R what it is. You could:

add/restore the tsp attribute, e.g. tsp(y_num) <- c(6,105,1). In general you could use

tsp(y_num) <- c(1+lag_val,length(y_num)+lag_val, 1)

use lag() as you suggest above, but with the known degree of lagging: ccf(x, lag(y_num,5), type="covariance") works fine
Sub-optimally, pad the time series with zeros, e.g. ccf(c(rep(0,5),y),x) - but this will slightly change the CCF calculation. (You can't pad with NA values.)

Otherwise the only thing R can do is assume that all vectors start at time 1.

I'm going to take another try at this (maybe this should be a separate answer); I think your simulation, and in particular lag(), does not work the way you think it does. lag() does not change the actual values in the vector, it simply informs R that the starting time of the time series is different. Instead, let's simulate more explicitly by making two triangle-wave patterns that are out of phase with each other:

set.seed(101)
x <- rep(c(1:5,5:1),10) + rnorm(100,sd=0.5)
y <- rep(c(5:1,1:5),10) + rnorm(100,sd=0.5)
matplot(cbind(x,y),type="l",col=1:2,lty=1)

Now try ccf():

ccf(x,y,"covariance",lag.max=10)[

enter image description here ]2

It works fine ...

Alternatively you could do something like

 ## pad at the beginning
 y <- c(rep(0,5),x) + rnorm(length(x)+5)
 ## pad at the end
 x <- c(x,rnorm(5))

to simulate.

Ok I understand but the point is to catch the dependency so R should be able to catch this without me having to specify it — user9396820, Aug 15 '18 at 18:51
How do you think R should catch it? Unless you have padding and/or a `tsp` attribute there is **no information in the data that R could possibly use to guess what the lag is**. When you specify `lag(y,0)` you are telling R that there is zero lag, so it's just doing it what you asked ... — Ben Bolker, Aug 15 '18 at 20:08

non-understandable behavior ccf function stats package

1 Answers1