Is using causal convolution / padding equivalent to shifting the outputs back?

Question

I'm having some trouble understanding the purpose of causal convolutions. Suppose I'm doing time-series classification using a convolutional network with 1 layer and kernel=2, stride=1, dilation=0. Isn't it the same thing as shifting my output back by 1?

For larger networks, it would be a little more involved to take into account the parameters of all the layers to get the resulting receptive field to do a proper output shift. To me it seems, if there is some leak, you could always account for the leak by shifting the output back.

For example, if at time step $t_2$, a non-causal CNN sees $x_0, x_1, x_2, x_3, x_4$, then you'd use the target associated with $t_4$, i.e. $y_4$

Edit: I've seen diagrams for causal CNNs where all the arrows a right-aligned. I get that it's meant to illustrate that $y_t$ aligns to $x_t$, but couldn't you just as easily draw them like this:

Non_Causal CNN Right-aligned

score 1 · Answer 1 · answered Dec 21 '22 at 04:44

1

The point of causal convolutions is not to see 'future' data. This is important in real time sequential analysis because we won't have access to new information before it happens, however we typically do in training (due to having the whole training sequence). Therefore, causal convolutions begin t-k//2 and end at t (where t = current timestep and k = kernel size), rather than a typical convolution which starts at t-k//2 and end at t+k//2. This can be imagined as a 1-sided kernel, where instead of having the target pixel/sample be in the centre of the kernel, it's now the rightmost (going from L-R) part of the kernel.

Using your example, if the top orange dot in the following picture is t_n, then t_n has a receptive field stemming from t_n-4 to t_n due to it having a kernel size of 2 and 4 layers.

Compare that to a noncausal convolution (ignore the dilated convolution on the right), where the receptive field stems from t_n-3 to t_n+3 due to it being a double-sided kernel:

answered Dec 21 '22 at 04:44

jhso

3,103
1
5
13

Thanks, but I'm still not clear on how it's different from shifting targets back by `k//2` during training with non-causal convolution in your example -- it's not leaking any future data since the targets already account for `t+k//2`. In real-time sequential analysis, CNNs trained this way performs just fine since it'll just take as input the latest sequence (there's no need for "future" data). Yes, the kernel is one sided in the causal CNN, but again I could just half the size of the kernel with the regular convolutions and targets shifted back by `k//2`, and to me that's the same thing. – ron burgundy Dec 22 '22 at 17:38
I've added a diagram in my question to better illustrate my point – ron burgundy Dec 22 '22 at 18:13
I think for most samples it is the same (as long as you use the correct kernel size, of course), however how would you handle `t=1` if you shift it back by `k//2`? And would `T_n` be 0 in your case? – jhso Dec 23 '22 at 00:29
I would start training at `t >= k//2` (to be precise, `k` is the length of the receptive field of the final layer node) – ron burgundy Dec 23 '22 at 18:54

Is using causal convolution / padding equivalent to shifting the outputs back?

1 Answers1