5

I have a rather odd question. I am currently working with time-series data and have several peaks in my dataset. This data is gathered using a neutron density log machine and it describes an event recorded by the sensor continuously for some time. The peaks in the data correspond to some interesting interval as this machine goes down the borehole. So, the peak are important. However, it's not just the peaks that are important. It's the entire interval (or at least as I describe it; see my attached figure). Now my question is that is there a signal processing (preferably in Python) method I can use or look into which can allow me to bin this signal into different intervals, each corresponding to local minima/maxima?

My initial approach was using burst detection algorithms as described by Kleinberg 2002, but I had no success with that, so I want to get other people's opinions on this.

This is original data:

graph of proportion of target events vs. time

This is what I want to do:

annotated graph

John Kugelman
  • 349,597
  • 67
  • 533
  • 578
jon_weiser
  • 61
  • 4
  • Could you use ['scipy.signal.hilbert'](https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.hilbert.html) to get the upper envelope of your signal and rebin it consequently? – gehbiszumeis Oct 02 '19 at 06:42
  • Welcome to SO, which is about *specific coding* questions and not a recommendation or discussion forum; your question is way too broad, please do take some time to read [How to Ask](https://stackoverflow.com/help/how-to-ask) and [What topics can I ask about here?](https://stackoverflow.com/help/on-topic). – desertnaut Oct 02 '19 at 13:16
  • 2
    My question is about coding. Thank you gehbiszumeis. – jon_weiser Oct 02 '19 at 14:31

2 Answers2

4

As the original data is not available I generated some test data. The approach is as follows. If the data is a set of plateaus with jumps, checking the standard deviation on a moving window, will give peaks at jumps. With a threshold isolate the peaks. Get an estimate for the jump by looking at the data after applying a moving average. This provides a good starting point for fitting. As fit function I use once again my favourite tanh.

The result looks as follows:

import matplotlib.pyplot as plt
import numpy as np
from operator import itemgetter
from itertools import groupby
from scipy.optimize import leastsq

def moving_average( data, windowsize, **kwargs ):
    mode=kwargs.pop('mode', 'valid')
    weight = np.ones( windowsize )/ float( windowsize )
    return np.convolve( data, weight, mode=mode)

def moving_sigma( data, windowsize ):
    l = len( data )
    sout = list()
    for i in range( l - windowsize + 1 ):
        sout += [ np.std( data[ i : i + windowsize ] ) ]
    return np.array( sout )

def m_step( x, s, x0List, ampList, off ):
    assert len( x0List ) == len( ampList )
    out = [ a * 0.5 * ( 1 + np.tanh( s * (x - x0) ) ) for a, x0 in zip( ampList, x0List )]
    return sum( out ) + off

def residuals( params, xdata, ydata ):
    assert not len(params) % 2
    off = params[0]
    s = params[1]
    rest = params[2:]
    n = len( rest )
    x0 = rest[:n/2]
    am = rest[n/2:]
    diff = [ y - m_step( x,s, x0, am, off ) for x, y in zip( xdata, ydata ) ]
    return diff 

### generate data
np.random.seed(776)
a=0
data = list()
for i in range(15):
    a = 50 * ( 1 - 2 * np.random.random() )
    for _ in range ( np.random.randint( 5, 35 ) ):
        data += [a]

xx =  np.arange( len( data ), dtype=np.float )

noisydata = list()
for x in data:
    noisydata += [ x + np.random.normal( scale=5 ) ]

###define problem specific parameters
myWindow = 10
thresh = 8
cutoffscale = 1.1

### data treatment
avdata = moving_average( noisydata, myWindow )
avx = moving_average( xx, myWindow )
sdata = moving_sigma( noisydata, myWindow )

### getting start values for the fit
seq = [x for x in np.where( sdata > thresh )[0] ]
# from https://stackoverflow.com/a/3149493/803359
jumps = [ map( itemgetter(1), g ) for k, g in groupby( enumerate( seq ), lambda ( i, x ) : i - x ) ]
xjumps = [ [ avx[ k ] for k in xList ] for xList in jumps ]
means = [ np.mean( x ) for x in xjumps ]
### the delta is too small as one only looks above thresh so lets increase it by guessing
deltas = [ ( avdata[ x[-1] ] - avdata[ x[0] ] ) * cutoffscale for x in jumps ]

guess = [ avdata[0], 2] + means + deltas

### fitting
sol, err = leastsq( residuals, guess, args=( xx, noisydata ), maxfev=10000 )
fittedoff = sol[0]
fitteds = sol[1]
fittedx0 = sol[ 2 : 2 + len( means ) ] 
fittedam = sol[ 2 + len( means ) : ] 

### plotting
fig = plt.figure()
ax = fig.add_subplot( 1, 1, 1 )
# ~ ax.plot( data )
ax.plot( xx, noisydata, label='noisy data' )
ax.plot( avx, avdata, label='moving average' )
ax.plot( avx, sdata, label='window sigma' )

for j in means:
    ax.axvline( j, c='#606060', linestyle=':' )

yy = [ m_step( x, 2, means, deltas, avdata[0] ) for x in xx]
bestyy = [ m_step( x, fitteds, fittedx0, fittedam, fittedoff ) for x in xx ]
ax.plot( xx, yy, label='guess' )
ax.plot( xx, bestyy, label='fit' )

plt.legend( loc=0 )
plt.show()

Which gives the following graph:

enter image description here

Some smaller jumps are not detected, but it should be clear that this is expected. Moreover, the entire process is not very fast. Due to the threshold, the initial guess get worse to larger x the more jumps are present. Finally, a sequence of small jumps is not detected such that a resulting slope in the data is fitted as one large plateau with an obvious large error. One could introduce an additional check for this and add positions for jumps to fit accordingly.

Also notice that the choice of the window size determines down to which distance two jumps will be detected as individual ones.

mikuszefski
  • 3,943
  • 1
  • 25
  • 38
0

I extracted data from the scatterplot for analysis, and here is my first cut at the problem with example code to build on. What this does is fit a third-order polynomial to all of the data as a baseline, and then look for data points at a cutoff distance above or below that baseline curve in the portion of the code labeled "rectangularize section". I hope this might at least suggest a path to solution of the problem. One enhancement that came to mind was to implement a minimum width in the "rectangularize" code.

plot

import numpy, matplotlib
import matplotlib.pyplot as plt


##########################################################
# data load section
f = open('/home/zunzun/neutron_refl_bore/rawdata.dat')
rawdata = f.read()
f.close()

xData = []
yData = []
for line in rawdata.split('\n'):
    if line:
        spl = line.split()
        xData.append(float(spl[0]))
        yData.append(float(spl[1]))
xData = numpy.array(xData)
yData = numpy.array(yData)


##########################################################
#rectangularize section
cutoffLimit = 1.6
polynomialOrder = 3 # for baseline curve
modelParameters = numpy.polyfit(xData, yData, polynomialOrder) 
modelPredictions = numpy.polyval(modelParameters, xData)
modelDifference = yData - modelPredictions
maxDiff = max(modelDifference)
minDiff = min(modelDifference)

aboveLimit = (modelDifference > cutoffLimit) * maxDiff
belowLimit = (modelDifference < -cutoffLimit) * minDiff
rectanglarized = modelPredictions +  + belowLimit + aboveLimit

##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
    f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
    axes = f.add_subplot(111)

    # first the raw data as a scatter plot
    axes.plot(xData, yData,  'D')

    # create data for the fitted equation plot
    xModel = numpy.linspace(min(xData), max(xData))
    yModel = numpy.polyval(modelParameters, xModel)

    # now the model as a line plot
    axes.plot(xModel, yModel)

    # now show rectangularized
    axes.plot(xData, rectanglarized)

    axes.set_title('Data Outside Polynomial Baseline +/- Cutoff') # add a title
    axes.set_xlabel('X Data') # X axis data label
    axes.set_ylabel('Y Data') # Y axis data label

    plt.show()
    plt.close('all') # clean up after using pyplot

graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)
James Phillips
  • 4,526
  • 3
  • 13
  • 11