1

I think I am missing something rather fundamental with cross correlation. I have two timeseries, x and y (pandas series, with DateTime index, equal length). I would like to check time alignment - e.g. make sure the uppy-downy bits in both timeseries occur at roughly the same time, and shift them into alignment if they are out. For this, I used scipy.signal.correlate to find the lag where correlation between the two timeseries is highest. I used the following (minimal example) code, based on guidance from https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.correlate.html and https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.correlation_lags.html

def time_alignment_check(x, y):
  
    from scipy import signal

    x = x.dropna()
    y = y.dropna()

    corr = signal.correlate(x, y)
    lags = signal.correlation_lags(len(x), len(y))
    corr /= np.max(corr)

    fig, (ax_data, ax_corr) = plt.subplots(2, 1, figsize=(7, 8))
    ln1 = ax_data.plot(x, 'r', label='x')
    ax2 = ax_data.twinx()
    ln2 = ax2.plot(y, 'b', label='y')
    ax_data.legend()
    ax_data.set_xlabel('Time')
    ax_data.xaxis.set_major_formatter(dates.DateFormatter('%H:%M:%S'))

    lns = ln1 + ln2
    labs = [l.get_label() for l in lns]
    ax_data.legend(lns, labs, loc=0)

    ax_corr.plot(lags, corr)
    ax_corr.annotate(f'max. corr. where lag={lags[np.argmax(corr)]}',
                     xy=(0.35, 0.1), xycoords='axes fraction', fontsize=10, color='k',
                     bbox=dict(facecolor='white', alpha=0.8, ec='k'))
    ax_corr.set_title('Cross-correlated signal')
    ax_corr.set_xlabel('Lag')
    ax_corr.set_xlim([-200, 200])

    ax_data.margins(0, 0.1)
    ax_corr.margins(0, 0.1)
    fig.tight_layout()

    plt.show()

Running the analysis on the entire timeseries (image 1) yields a lag of 26 seconds. So I then shifted y to the right by 26 seconds, and re-ran the analysis. I expected the lag to then equal 0, but it doesn't, it still equals 26... Why? When I run the analysis on a smaller chunk/transect of the data, e.g. a 40 minute chunk from 13:00:00 to 13:40:00 (image 2), lag=1 second. Shifting y on the smaller chunk by n seconds again does not change the lag.

Signal correlation shift and lag correct only if arrays subtracted by mean suggests deducting the mean from the dataset, which, for a much shorter, less variable dataset (such as below) gives the correct lag.

y = pd.Series([1.,1.,1.,1.,2.,2.,1.,1.,1.,1.], index=range(0,10))
x = pd.Series([1.,1.,1.,1.,1.,1.,1.,2.,2.,1.], index=range(0,10))
x -= np.mean(x)
y -= np.mean(y)

This method (deducting the mean) also yield's incorrect results for my data. What am I missing here?

Thanks in advance!

Whole timeseries

timeseries from 13:00:00 to 13:40:00

1 Answers1

1

Figured it out. For x and y as pandas series, it's to do with how the data is shifted. If the shift (pd.shift) is applied using freq='s' (seconds) you get an incorrect lag even though, confusingly (for me!) the plot of x and y will display the correct shift by n seconds (this actually makes sense once I read the docs properly: "If freq is specified then the index values are shifted but the data is not realigned".). When freq=None, the correct results are shown.