0

I believe that the question Is there a good way to do this type of mining? could be solved using linear programming techniques. But I am completely new to this and do not know the best way to frame this as a minimization.

Would the following approach be OK?

  • Have a continuous variable for each row and column which is the "length" spanned by all members in that row/column
  • Have a variable for each "point" (each black dot) that indicates whether it is a member of the row or column group
  • Minimize the sum of the first variables

And is there a better way of doing this? Is it possible to somehow frame this as a pure constraint problem (ie without the minimisation)? Do I have my terminology correct? Thanks!

Community
  • 1
  • 1
andrew cooke
  • 45,717
  • 10
  • 93
  • 143

2 Answers2

1

Yes, you could definitely use linear programming for this, but it is hard and I think you have to define your problem more precisely. I have too many questions for a comment, I hope you don't mind I write this as an answer...

Your points can be either in the "column group" or in the "row group". From your proposition above, I understand that you know the number of column groups and row groups in advance?

So you know your groups composition, you just want to find a repartition of the points in those groups in order to minimise the sum of the costs, determined by:

  • The vertical width of the horizontal clusters (c(H) = max (i,j in H) |yi - yj|)
  • The horizontal width of the vertical clusters (c(V) = max (i,j in V) |xi - xj|)

With H an horizontal cluster, V a vertical cluster, and the total cost will be:

c(H1) + c(H2) + ... + c(Hn) + c(V1) + c(V2) + ... + c(Vp)

with n (number of horizontal clusters) and p (number of vertical clusters) known in advance. Is this correct?

For the horizontal groups, you say you can't have "holes". I would represent this as a constraint of your problem, if you can quantify the size of the holes. For instance:

for each i in C, ( min (j in C) |xi - xj|  ) < r

will insure that you don't have a gap of more than r in the horizontal cluster C. Is this what you want? Is r a fixed number?

Is this the complete problem, or do you have other constraints (minimal number of points per group, or something)?

Do you need an exact minimal solution, or a "good" solution would be enough?

Finally, for the technical part, since your previous post was tagged 'python' and this one is not, do you have to use python to solve the model?

Nicolas Grebille
  • 1,332
  • 8
  • 15
  • hi, thanks! the original question wasn't mine, so i don't know all the details - i was just curious. however, i have since been looking at packages like lp_solve and they don't seem to handle things like `min()` or loops. what would be a good (free) way to solve this? i would prefer python or java personally, but would also like to know what would be "global best".... ps the bounty was about to expire, so i went ahead and marked this best answer. also (IMPORTANT) i think the number of clusters is NOT known in advance, and that seems to make this harder.... – andrew cooke Aug 30 '11 at 00:08
  • Thks for the bounty ! Yes, it is (much) harder if you don't know the number of clusters because you can't add a binary variable for each cluster/point pair (true if the point is in the cluster), which is the more natural way to model this... – Nicolas Grebille Aug 30 '11 at 00:14
  • You can't use a min inside the constraints, but you often can use one additional variable: for instance, `min (x in C) f(x)` can be represented by a variable `y` such as: `for each x in C, f(x) >= y` in some cases (e.g. when you `y` have a positive coefficient in the objective of a minimisation problem) – Nicolas Grebille Aug 30 '11 at 00:17
  • For the "loops", those are `for each`, i.e. you can add *all* the constraints for each point: again if you have a binary variable `x(i,C)` such as `x(i,C) = 1 iff i is in C`, the constraint: `for each i in C, f(x) < K` is the same as: `for each point, x(i,C)*f(x) < K`, who can be linearised if `f` is linear: you don't have to use another tool, I think the model above can be transformed in a linear problem without too much pain (for a fixed number of clusters) – Nicolas Grebille Aug 30 '11 at 00:25
  • My ideas on how to handle this with a variable number of groups are at http://www.acooke.org/cute/MixedInteg0.html - but that's as far as I got (my laptop OS just died so I'm a bit busy trying to fix that right now...). Any pointers on some system that is both flexible enough and well documented enough for me to experiment further would be appreciated... – andrew cooke Aug 30 '11 at 00:29
  • It depends... If you are in the academic word, you can have a free license for Cplex thanks to the ibm academic initiative. I don't think it is your case however (from your blog), but if you can write C++ I whould definitely give a shot at coin-or (www.coin-or.org), wich looks like a very good open-source project for linear and mixed-integer solvers. – Nicolas Grebille Aug 30 '11 at 00:46
  • Awesome, thanks (no longer an academic - and never a CS one). – andrew cooke Aug 30 '11 at 00:47
  • PS just thinking about this again let me solve the original problem! if i haven't screwed up then there is an embarrassingly simple solution - http://stackoverflow.com/questions/7076349/is-there-a-good-way-to-do-this-type-of-mining/7237972#7237972 – andrew cooke Aug 30 '11 at 02:07
0

I finally worked out how to represent this question in a linear form. There's a full description in my answer at Is there a good way to do this type of mining? but here is a quick summary:

  • Use binary (0/1) variables for each neighbouring pair on a row, F_i. This will be 1 when the pair are in the same group and 0 otherwise.

  • Use constants S_i to describe the number of spaces between each pair of points.

  • Minimise the sum of two terms:

    • The sum of 1 - F-i. Minimising this pushes pairs together into larger groups.

    • The sum of F_i * S_i. Minimising this separates paris with large spacings.

By altering the relative weighting of the two terms you can change the importance of spacing between horizontal groups.

This relies on an asymmetry in the problem, in which horizontal groups are sensitive to spacing but vertical groups are not.

Community
  • 1
  • 1
andrew cooke
  • 45,717
  • 10
  • 93
  • 143