I have a set of data:
(1438672131.185164, 377961152)
(1438672132.264816, 377961421)
(1438672133.333846, 377961690)
(1438672134.388937, 377961954)
(1438672135.449144, 377962220)
(1438672136.540044, 377962483)
(1438672137.172971, 377962763)
(1438672138.24253, 377962915)
(1438672138.652991, 377963185)
(1438672139.069998, 377963285)
(1438672139.44115, 377963388)
What I need to figure out is how to group them. Until now I've used a super-duper simple approach, just by diffing two of the second part of the tuple and if the diff was bigger than a certain pre-defined threshold I'd put them into different groups. But it's yielded only unsatisfactory results.
But theoretically I imagine, that it should be possible to determine wether a value of the second part of the tuple belongs to the same group or not, by fitting them on one or multiple line, because I know about the first part of the tuple that it's strictly monotenous because it's a timestamp (time.time()
) and I know that all data sets that result will be close to linear. Let's say the tuple is (y, x)
. There are only three options:
- Either all data fits the same equation
y = mx + c
- Or there is only a differing offset
c
or - there is an offset
c
and a differentm
The above set would be one group only. The following group would resolve in three groups:
(1438672131.185164, 377961152)
(1438672132.264816, 961421)
(1438672133.333846, 477961690)
(1438672134.388937, 377961954)
(1438672135.449144, 962220)
(1438672136.540044, 377962483)
(1438672137.172971, 377962763)
(1438672138.24253, 377962915)
(1438672138.652991, 377963185)
(1438672139.069998, 477963285)
(1438672139.44115, 963388)
group1:
(1438672131.185164, 377961152)
(1438672134.388937, 377961954)
(1438672136.540044, 377962483)
(1438672137.172971, 377962763)
(1438672138.24253, 377962915)
(1438672138.652991, 377963185)
group2:
(1438672132.264816, 961421)
(1438672135.449144, 962220)
(1438672139.44115, 963388)
group3:
(1438672133.333846, 477961690)
(1438672139.069998, 477963285)
Is there a module or otherwise simple solution that will solve this problem? I've found least-squares in numpy and scipy, but I'm not quite sure how to properly use or apply them. If there is another way besides linear functions I'm happy to hear about them as well!
EDIT 2 It is a two dimensional problem unfortunately, not one-dimensional. For example
(1439005464, 477961152)
should (if assuming for this data a relationship approximately 1:300) would still be first group.