I have a data set of positions (e.g. the x- or y-position of a movable object). The object moves over time, let's say linearly. The distance between consecutive positions is within a certain range (e.g. 1 +/- 2.0 std). Now due to data artifacts jumps may occur, for example due to overflow some positions may jump to a whole different position which is clearly out of the ordinary.
I would like to identify the elements in my positions array that are affected by these artifacts.
Consider the following positions which grow linearly with some noise:
import numpy as np
linear_movement = np.arange(0, 100, 1)
noise = np.random.normal(loc = 0.0, scale = 2.0, size = linear_movement.size)
positions = linear_movement + noise
positions[78] = positions[78]+385
Here position 78 is affected by an artifact.
Since 'positions' is not distributed about a fixed position and data could vary over the course of the movement such that outlier positions are reached regularly later on (e.g. if I went from 0 to 1000 according to np.arange(0, 1000, 1)
) I can't simply sort out positions based on a median + some offset (as e.g. here: https://stackoverflow.com/a/16562028 ).
I would rather take a look at the mutual distance between consecutive positions to use for the identification of outliers:
distance = np.diff(positions)
First problem (which I could code around in a dirty way I suppose if there where only single outliers):
In the distance array 1 outlier in the original positions array produces 2 outliers.
Moreover, when there are e.g. 4 consecutive outliers, the distance array in between those positions will be claiming everything is normal:
import numpy as np
import matplotlib.pyplot as plt
linear_movement = np.arange(0, 100, 1)
noise = np.random.normal(loc = 0.0, scale = 2.0, size = linear_movement.size)
positions = linear_movement + noise
positions[78:82] = positions[78:82] + 385
'draw'
plt.figure()
plt.plot(positions)
distance = np.diff(positions)
distance.astype(int)
Output:
Out[264]:
array([ 0, 1, 1, 1, 2, -3, 3, -2, 5, 0, 0,
0, 1, 1, -1, 3, -3, 4, 1, 1, 0, 0,
1, 1, -1, 4, -4, 1, 1, 4, 0, 2, 0,
0, 1, 1, 2, 0, 0, 0, 3, -3, 3, 2,
0, 0, 0, 2, 2, 1, -3, 5, 0, 3, -1,
0, 2, -2, 2, 3, 1, -3, 0, 4, 0, 6,
0, -3, 2, 3, -3, 3, -1, 1, 4, -1, 3,
382, 0, 2, -3, -377, 0, 0, 3, 0, 2, 0,
0, 1, -2, 3, 0, 0, 2, 2, 5, -4, 4])
Things I have noted:
- Every second "big number" in the distance array things in the positions array return to "normal"... (apart from special cases with positions array starting or ending with "outliers")
- When there are multiple consecutive outliers, distances in between the outliers itself are inconspicuous which makes identifying them harder.
Is there a smart way or even precoded function that would take care of something like this? In my experience I am often times making the problem much more complicated than it really is ...
I could think of noting down the indices of the big numbers, take every second element (and second +1) of that indices and slice the positions array according to those... but that seems messy and again would need special cases for starting and ending with outliers.
Best