What techniques are effective to find periodicity in arbitrary data points?

Question

By "arbitrary" I mean that I don't have a signal sampled on a grid that is amenable to taking an FFT. I just have points (e.g. in time) where events happened, and I'd like an estimate of the rate, for example:

p = [0, 1.1, 1.9, 3, 3.9, 6.1 ...]

...could be hits from a process with a nominal periodicity (repetition interval) of 1.0, but with noise and some missed detections.

Are there well known methods for processing such data?

Is there something wrong with just taking the mean or median of the intervals? — lxop, Aug 27 '20 at 20:19
No, that's my go-to for the simplest cases, like when you know there is one and only one unchanging repetition rate in the whole data. Sometimes there is nothing in the data, or it changes, or the phasing changes, e.g. [1, 2, 3, 4, 4.5, 5.5, 6.5, 7.5]. — Mastiff, Aug 27 '20 at 20:34
https://www.statsmodels.org/dev/generated/statsmodels.tsa.stattools.acf.html If you need some working code example, let me know. I used this to produce a graph, where you can easily see a spike at/around periodic — Willem Hendriks, Sep 02 '20 at 07:00

francis · Accepted Answer · 2020-09-01T21:31:59.987

A least square algorithm may do the trick, if correctly initialized. A clustering method can be applied to this end.

As an FFT is performed, the signal is depicted as a sum of sine waves. The amplitude of the frequencies may be depicted as resulting from a least square fit on the signal. Hence, if the signal is unevenly sampled, resolving the same least square problem may make sense if the Fourier transform is to be estimated. If applied to a evenly sampled signal, it boils down to the same result.

As your signal is descrete, you may want to fit it as a sum of Dirac combs. It seems more sound to minimize the sum of squared distance to the nearest Dirac of the Dirac comb. This is a non-linear optimization problem where Dirac combs are described by their period and offset. This non-linear least-square problem can be solved by mean of the Levenberg-Marquardt algorithm. Here is an python example making use of the scipy.optimize.leastsq() function. Moreover, the error on the estimated period and offset can be estimated as depicted in How to compute standard deviation errors with scipy.optimize.least_squares . It is also documented in the documentation of curve_fit() and Getting standard errors on fitted parameters using the optimize.leastsq method in python

Nevertheless, half the period, or the thrid of the period, ..., would also fit, and multiples of the period are local minima that are to be avoided by a refining the initialization of the Levenberg-Marquardt algorithm. To this end, the differences between times of events can be clustered, the cluster featuring the smallest value being that of the expected period. As proposed in Clustering values by their proximity in python (machine learning?) , the clustering function sklearn.cluster.MeanShift() is applied.

Notice that the procedure can be extended to multidimentionnal data to look for periodic patterns or mixed periodic patterns featuring different fundamental periods.

import numpy as np

from scipy.optimize import least_squares
from scipy.optimize import leastsq

from sklearn.cluster import MeanShift, estimate_bandwidth

ticks=[0,1.1,1.9,3,3.9,6.1]

import scipy
print scipy.__version__


def crudeEstimate():
    # loooking for the period by looking at differences between values :
    diffs=np.zeros(((len(ticks))*(len(ticks)-1))/2)
    k=0
    for i in range(len(ticks)):
        for j in range(i):
            diffs[k]=ticks[i]-ticks[j]
            k=k+1
    #see https://stackoverflow.com/questions/18364026/clustering-values-by-their-proximity-in-python-machine-learning
    X = np.array(zip(diffs,np.zeros(len(diffs))), dtype=np.float)
    bandwidth = estimate_bandwidth(X, quantile=1.0/len(ticks))
    ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
    ms.fit(X)
    labels = ms.labels_
    cluster_centers = ms.cluster_centers_
    print cluster_centers
    labels_unique = np.unique(labels)
    n_clusters_ = len(labels_unique)

    for k in range(n_clusters_):
        my_members = labels == k
        print "cluster {0}: {1}".format(k, X[my_members, 0])
    estimated_period=np.min(cluster_centers[:,0])
    return estimated_period

def disttoDiracComb(x):
    residual=np.zeros((len(ticks)))
    for i in range(len(ticks)):
        mindist=np.inf
        for j in range(len(x)/2):
            offset=x[2*j+1]
            period=x[2*j]
            #print period, offset
            index=np.floor((ticks[i]-offset)/period)
           
            #print 'index', index
            currdist=ticks[i]-(index*period+offset)
            
            if currdist>0.5*period:
                 currdist=period-currdist
                 index=index+1
            #print 'event at ',ticks[i], 'not far from index ',index, '(', currdist, ')'
            #currdist=currdist*currdist
            #print currdist
            if currdist<mindist:
                 mindist=currdist
        residual[i]=mindist
    #residual=residual-period*period
    #print x, residual
    return residual


estimated_period=crudeEstimate()
print 'crude estimate by clustering :',estimated_period

xp=np.array([estimated_period,0.0])
#res_1 = least_squares(disttoDiracComb, xp,method='lm',xtol=1e-15,verbose=1)


p,pcov,infodict,mesg,ier=leastsq(disttoDiracComb, x0=xp,ftol=1e-18, full_output=True)
#print ' p is ',p, 'covariance is ', pcov

# see https://stackoverflow.com/questions/14581358/getting-standard-errors-on-fitted-parameters-using-the-optimize-leastsq-method-i
s_sq = (disttoDiracComb(p)**2).sum()/(len(ticks)-len(p))
pcov=pcov *s_sq

perr = np.sqrt(np.diag(pcov))
#print 'estimated standard deviation on parameter :' , perr 

print 'estimated period is ', p[0],' +/- ', 1.96*perr[0]
print 'estimated offset is ', p[1],' +/- ', 1.96*perr[1]

Applied to your sample, it prints :

crude estimate by clustering : 0.975
estimated period is  1.0042857141346768  +/-  0.04035792507868619
estimated offset is  -0.011428571139828817  +/-  0.13385206912205957

score 0 · Answer 2 · answered Aug 27 '20 at 20:44

It sounds like you need to decide what exactly you want to determine. If you want to know the average interval in a set of timestamps, then that's easy (just take the mean or median).

If you expect that the interval could be changing, then you need to have some idea about how fast it is changing. Then you can find a windowed moving average. You need to have an idea of how fast it is changing so that you can select your window size appropriately - a larger window will give you a smoother result, but a smaller window will be more responsive to a faster-changing rate.

If you have no idea whether the data is following any sort of pattern, then you are probably in the territory of data exploration. In that case, I would start by plotting the intervals, to see if a pattern appears to the eye. This might also benefit from applying a moving average if the data is quite noisy.

Essentially, whether or not there is something in the data and what it means is up to you and your knowledge of the domain. That is, in any set of timestamps there will be an average (and you can also easily calculate the variance to give an indication of variability in the data), but it is up to you whether that average carries any meaning.

I take your point, but analogously, if I'm looking for periodic components in a signal using an FFT, peaks become visible. There is a thresholding problem to manage, but peaks can exist far enough above the noise to be virtually impossible to attain due to random chance. I'd think something similar exists here. Even in the presence of missing data, false detections and noise, strong periodicities should eventually "pop" if strong enough and analyzed properly. — Mastiff, Aug 27 '20 at 22:47
Sure. In that case, you can consider the variance. Again it depends on whether you're expecting a time-varying period, but in the simple case of a static period, you can simply decide on a variance threshold below which you are willing to say that the period is real. For a time-varying period, you can calculate variance in the window, and you again just need to decide on a threshold for whether it is meaningful or not. — lxop, Aug 27 '20 at 23:00

What techniques are effective to find periodicity in arbitrary data points?

2 Answers2