1

I have a set of elements, which is for example

x= [250,255,273,180,400,309,257,368,349,248,401,178,149,189,46,277,293,149,298,223]

I want to group these into n number of groups A,B,C... such that sum of all group variances is minimized. Each group need not have same number of elements.

I would like a optimization approach in python or R.

Yogesh
  • 1,384
  • 1
  • 12
  • 16
  • 1
    I guess you need to provide more information, e..g,: 1. if all groups are of equal size; 2. variable number of groups or fixed number for the minimal sum....so on and so forth – ThomasIsCoding Nov 27 '19 at 05:03
  • @ThomasIsCoding Edited, fixed number if groups. say 4 or 5. – Yogesh Nov 27 '19 at 05:20
  • If each group should be equal sized then do you not just want the list ordered and split into n number of sub lists? I see you've updated your question... – Iain Shelvington Nov 27 '19 at 05:21
  • 2
    Related, possible inspiration: [1 dimensional clustering](https://stackoverflow.com/a/11516590/903061) – Gregor Thomas Nov 27 '19 at 05:24
  • You can try library(mclust); x=c(250,255,273,180,400,309,257,368,349,248,401,178,149,189,46,277,293,149,298,223); Mclust(x,G=2:10,modelNames="E") – StupidWolf Nov 27 '19 at 08:14
  • You could try a Local Search. For R, perhaps this tutorial https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3391756 helps. – Enrico Schumann Nov 28 '19 at 07:49

1 Answers1

0

I would sort the numbers into increasing order and then use dynamic programming to work out where to place the boundaries between groups of contiguous elements. For example, if the only constraint is that every number must be in exactly one group, work from left to right. At each stage, for i=1..n work out the set of boundaries that produces minimum variance computed among the elements seen so far for i groups. For i=1 there is no choice. For i>1 consider every possible location for the boundary of the last group, and look up the previously computed answer for the best allocation of items before this boundary into i-1 groups, and use the figure previously computed here to work out the contribution of the variance of the previous i-1 groups.

(I haven't done the algebra, but I believe that if you have groups A and B where mean(A) < mean(B) but there are elements a in A and b in B such that a > b, you can reduce the variance by swapping these between groups. So the lower variance must come from groups that are contiguous when the elements are written out in sorted order).

mcdowella
  • 19,301
  • 2
  • 19
  • 25