Consensus/Cluster a set of variable length lists in Python?

Question

I have a set of sensors measuring some temporal data. At any given time step, the sensor outputs either a 0 or a 1. A sensor will never output two 1s sequentially.

How can we find a best estimate, given the available sensors?

For example, say four sensors outputted a 1 at these provided indices.

A = [ 178,  511,  843, 1180, 1512, 1733]
B = [ 514,  846, 1182, 1515, 1736, 1937]
C = [ 182,  516,  848, 1517, 1738, 1939]
D = [ 179,  513,  845, 1181, 1513, 1735, 1936, 2124]

From visual inspection, I can see that:

A lost a value at the tail of the list
B lost a value at the head of the list
C lost a value in the middle of the list
D has an extra value at the tail of the list

# the None locations are not known to the consensus algorithm
a = [  178,  511,  843, 1180, 1512, 1733, None]
b = [ None,  514,  846, 1182, 1515, 1736, 1937]
c = [  182,  516,  848, None, 1517, 1738, 1939]
d = [  179,  513,  845, 1181, 1513, 1735, 1936] # 2124 removed

# Consensus: Average over columns with `None` removed
# rounded to the nearest integer
s = consensus((A,B,C,D))
s = [  180,  514,  849, 1181, 1514, 1736, 1937]

If we were to have two additional sensors E and F with the following values:

E = [ 2130 ]
F = [ 2121 ]
# these two sensors only have the one tail value
# therefore sensor D's extra reading is now part of consensus.
# All other values are unchanged.
s = consensus((A,B,C,D,E,F))
s = [  180,  514,  849, 1181, 1514, 1736, 1937, 2125]

Is there a solution to solve this problem that is not O(n^2)?

Is there a predictable range of variation around these temporal values? I'm thinking if another sensor gave a positive reading at 1000 - would that be a new value, or fall into one of the ~850 or 1180 values? If the gaps between and the tightness is clear you could test for groupings fairly easily. — katardin, Mar 12 '20 at 21:32
@katardin , Other than no back-to-back outputs from the same sensor, no further information about variation is provided. If `E=[1000]` and we ran `consensus((A,B,C,D,E))` I would expect this new reading to fall into the ~850 bin. — Alexander, Mar 12 '20 at 21:38
This appears to be a problem in clustering. I suggest that you merge the lists and look for clusters of values. Group and average the clustered values. Cluster tolerance is your decision. — Prune, Mar 12 '20 at 21:40
@Prune One challenge with using cluster algorithms is I do not know the number of clusters beforehand. — Alexander, Mar 12 '20 at 22:01
As I said, you need to design the cluster tolerance. Quantity of clusters is a requirement for only some of the easiest ones. — Prune, Mar 12 '20 at 22:05
I'll reference you to this answer where they suggest [Kernel Density Estimation](https://en.wikipedia.org/wiki/Kernel_density_estimation) which works for a 1-D variable and for which you do not have to specify the number of bins. Number of bins is an issue in this case because it doesn't seem like you can infer the actual number of events from your data type. https://stackoverflow.com/questions/11513484/1d-number-array-clustering — katardin, Mar 12 '20 at 21:47

Alexander · Accepted Answer · 2020-03-13T00:10:39.007

Thank you to the two users in the comments who were able to lead me to a working solution.

EDIT: I hesitate to mark this as the definitive answer as we lose information that each sensor is unique when we concatenate all the readings into one array. Also I think an iterative or dynamic programming approach, keeping track of the distance to nearest value per sensor, may be used as well.

from matplotlib import pyplot as plt
from sklearn.neighbors import KernelDensity
from scipy.signal import find_peaks

concat = A + B + C + D
X = np.array(concat)[:, np.newaxis]

X_plot = np.linspace(0, 1.1 * X.max(), 1000)[:, np.newaxis]

kde = KernelDensity(bandwidth=2).fit(X)
log_dens = kde.score_samples(X_plot)
dens = np.exp(log_dens)
peaks, _ = find_peaks(dens)

plt.plot(X_plot[:, 0], dens)
plt.plot(X_plot[peaks], dens[peaks], "X")
plt.show()

print(tuple(int(i) for i in X_plot[peaks].squeeze()))
# (180, 514, 846, 1181, 1513, 1735, 1936, 2123)

Consensus/Cluster a set of variable length lists in Python?

1 Answers1