0

I have a set of data:

(1438672131.185164, 377961152)                                                                                                       
(1438672132.264816, 377961421)                                                                                                       
(1438672133.333846, 377961690)                                                                                                       
(1438672134.388937, 377961954)                                                                                                      
(1438672135.449144, 377962220)
(1438672136.540044, 377962483)
(1438672137.172971, 377962763)
(1438672138.24253, 377962915)
(1438672138.652991, 377963185)
(1438672139.069998, 377963285)
(1438672139.44115, 377963388)

What I need to figure out is how to group them. Until now I've used a super-duper simple approach, just by diffing two of the second part of the tuple and if the diff was bigger than a certain pre-defined threshold I'd put them into different groups. But it's yielded only unsatisfactory results.

But theoretically I imagine, that it should be possible to determine wether a value of the second part of the tuple belongs to the same group or not, by fitting them on one or multiple line, because I know about the first part of the tuple that it's strictly monotenous because it's a timestamp (time.time()) and I know that all data sets that result will be close to linear. Let's say the tuple is (y, x). There are only three options:

  • Either all data fits the same equation y = mx + c
  • Or there is only a differing offset c or
  • there is an offset c and a different m

The above set would be one group only. The following group would resolve in three groups:

(1438672131.185164, 377961152)                                                                                                       
(1438672132.264816, 961421)                                                                                                       
(1438672133.333846, 477961690)                                                                                                       
(1438672134.388937, 377961954)                                                                                                      
(1438672135.449144, 962220)
(1438672136.540044, 377962483)
(1438672137.172971, 377962763)
(1438672138.24253, 377962915)
(1438672138.652991, 377963185)
(1438672139.069998, 477963285)
(1438672139.44115, 963388)

group1:

(1438672131.185164, 377961152)                                                                                                       
(1438672134.388937, 377961954)                                                                                                      
(1438672136.540044, 377962483)
(1438672137.172971, 377962763)
(1438672138.24253, 377962915)
(1438672138.652991, 377963185)

group2:

(1438672132.264816, 961421)                                                                                                       
(1438672135.449144, 962220)
(1438672139.44115, 963388)

group3:

(1438672133.333846, 477961690)                                                                                                       
(1438672139.069998, 477963285)

Is there a module or otherwise simple solution that will solve this problem? I've found least-squares in numpy and scipy, but I'm not quite sure how to properly use or apply them. If there is another way besides linear functions I'm happy to hear about them as well!

EDIT 2 It is a two dimensional problem unfortunately, not one-dimensional. For example

(1439005464, 477961152)

should (if assuming for this data a relationship approximately 1:300) would still be first group.

user857990
  • 1,140
  • 3
  • 14
  • 29
  • 1
    What is the sence of grouping your data like this? Why do you want to use a linear function? What do you mean by "unsatisfactory results"? – miindlek Aug 04 '15 at 08:32
  • Reason for linear function: Because they are likely going to match Unsatisfactory results: if you only diff, you might diff measured results that were measured timewise too far appart and then the grouping won't work anymore. What is the sense? I need them grouped – user857990 Aug 04 '15 at 08:47
  • 1
    How should the data be grouped? Give some examples of data items, which should be in the same group. Do you only need equal sized datasets? – miindlek Aug 04 '15 at 08:50
  • Use a linear fit with a threshold. Start by fitting first few points. When the next point comes, compare the distance to the line with a defined threshold. If it is over it, start a new line with two next points. If it is under it, readjust the first fit with the last data point. – yevgeniy Aug 04 '15 at 09:15
  • I edited the post. Does this make things clearer? Sorry, the first example only was not obvious at all. – user857990 Aug 04 '15 at 09:33
  • I guess this is exactely what you want: http://stackoverflow.com/questions/11513484/1d-number-array-clustering – miindlek Aug 04 '15 at 12:59
  • Looks good, but it is unfortunately two dimensional. Did another edit to give another example. – user857990 Aug 04 '15 at 13:17
  • @yevgeniy - doesn't solve the issue of the EDIT 2. What you describe is what I used before. – user857990 Aug 04 '15 at 13:47
  • Ok, so if it is two-dimensional, you should search for a clustering algorithm that fits your needs, for example kmeans. – miindlek Aug 04 '15 at 15:29
  • that is the answer I'm looking for. I don't know how to solve this. – user857990 Aug 04 '15 at 19:25

0 Answers0