Logic for grouping similar parameters

Question

I am trying to figure out the best logic for grouping parameters within a certain tolerance. It's easier to explain with an example...

Task1: parameter1=140
Task2: parameter1=137
Task3: parameter1=142
Task4: parameter1=139
Task5: parameter1=143

If I want to group tasks if they are within 2 of each other, I think I need to do several passes. For the example, the desired result would be this: Task4 covers Task1, Task2, and Task4 Task3 covers Task3 and Task5

There are multiple possibilities because Task1 could also cover 3 and 4 but then 2 and 5 would be two additional tasks that are by themselves. Basically, I would want the fewest number of tasks that are within 2 of each other.

I am currently trying to do this in excel VBA but I may port the code to php later. I really just don't know where to start because it seems pretty complex.

score 1 · Accepted Answer · answered May 24 '14 at 04:10

You'l need a clustering algorithm i'd assume. Consider the following parameters-

Task1: parameter1=140
Task2: parameter1=142
Task3: parameter1=144
Task4: parameter1=146
Task5: parameter1=148

Depending on your logic the clustering will get weird here. If you simply check each number for numbers near it all of these will be clustered. But does 140 and 148 deserve to be in the same cluster? Try kmeans clustering. There will be some gray area but the result will be relatively accurate.

http://en.wikipedia.org/wiki/K-means_clustering

score 1 · Answer 2 · edited May 23 '17 at 12:27

1

You can group tasks in a single pass if you decide the group boundaries before looking at the tasks. Here's a simple example using buckets of width 4, based on your goal to group tasks within +/-2 of each other:

Dim bucket As Integer

For Each parameter In parameters
    bucket = Round(parameter / 4, 0)

    ' ... do something now that you know what bucket the task is in
Next parameter

If the groups provided by fixed buckets don't fit the data closely enough for your needs, you will need to use an algorithm that makes multiple passes. Since the data in your example is one-dimensional, you can (and should!) use simpler techniques than k-means clustering.

A good next place to look might be Literate Jenks Natural Breaks and How The Idea Of Code is Lost, with a very well commented Jenks Natural Breaks Optimization in JavaScript.

edited May 23 '17 at 12:27

Community

1
1

answered May 24 '14 at 05:12

Will Angley

1,392
7
11

I didn't think it mattered because I would just use the suggested method twice but it actually is two-dimensional. I was trying to make it a simple question to ask. There is a parameter2 that would have similar constraints. The tasks would be grouped by parameter1 of values within 2 of each other and also by parameter2 with values within 2 of each other. Maybe k-means would be the way to go. – dreed75 May 24 '14 at 13:32
In this case, the dimensionality matters; you can *sort* one-dimensional data in a way that you can't for two or more dimensions. Try k-means, and consider using an existing implementation like SciPy's [kmeans](http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.vq.kmeans.html#scipy.cluster.vq.kmeans) instead of writing your own. This will let you see much more quickly if it gives the results you expect. – Will Angley May 24 '14 at 15:49

Logic for grouping similar parameters

2 Answers2