3

In Mathematica I have a list of point coordinates

size = 50;
points = Table[{RandomInteger[{0, size}], RandomInteger[{0, size}]}, {i, 1, n}];

and a list of cluster indices these points belong to

clusterIndices = {1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1};

what is the easiest way to split the points into two separate lists based on the clusterIndices values?

EDIT: The solution I came up with:

pointIndices =
  Map[#[[2]] &,
    GatherBy[MapIndexed[{#1, #2[[1]]} &, clusterIndices], First],
    {2}];
pointsByCluster = Map[Part[points, #] &, pointIndices];

It there a better way to do this?

Max
  • 19,654
  • 13
  • 84
  • 122

6 Answers6

5

As @High Performance Mark and @Nicholas Wilson said, I'd start with combining the two lists together via Transpose or Thread. In this case,

In[1]:= Transpose[{clusterIndices, points}]==Thread[{clusterIndices, points}]
Out[1]:= True

At one point, I looked at which was faster, and I think Thread is marginally faster. But, it only really matters when you are using very long lists.

@High Performance Mark makes a good point in suggesting Select. But, it would only allow you to pull a single cluster out at a time. The code for selecting cluster 1 is as follows:

Select[Transpose[{clusterIndices, points}], #[[1]]==1& ][[All, All, 2]]

Since you seem to want to generate all clusters, I'd suggest doing the following:

GatherBy[Transpose[{clusterIndices, points}], #[[1]]& ][[All, All, 2]]

which has the advantage of being a one liner and the only tricky part was in selecting the correct Part of the resulting list. The trick in determining how many All terms are necessary is to note that

Transpose[{clusterIndices, points}][[All,2]]

is required to get the points back out of the transposed list. But, the "clustered" list has one additional level, hence the second All.

It should be noted that the second parameter in GatherBy is a function that accepts one parameter, and it can be interchanged with any function you wish to use. As such, it is very useful. However, if you'd like to transform your data as your gathering it, I'd look at Reap and Sow.

Edit: Reap and Sow are somewhat under used, and fairly powerful. They're somewhat confusing to use, but I suspect GatherBy is implemented using them internally. For instance,

Reap[ Sow[#[[2]], #[[1]] ]& /@ Transpose[{clusterIndices, points}], _, #2& ]

does the same thing as my previous code without the hassle of stripping off the indices from the points. Essentially, Sow tags each point with its index, then Reap gathers all of the tags (_ for the 2nd parameter) and outputs only the points. Personally, I use this instead of GatherBy, and I've encoded it into a function which I load, as follows:

SelectEquivalents[x_List,f_:Identity, g_:Identity, h_:(#2&)]:=
   Reap[Sow[g[#],{f[#]}]&/@x, _, h][[2]];

Note: this code is a modified form of what was in the help files in 5.x. But, the 6.0 and 7.0 help files removed a lot of the useful examples, and this was one of them.

rcollyer
  • 10,475
  • 4
  • 48
  • 75
  • +1 This is a much better answer than @Mark Fisher's, and a couple of the others. I'd throw in @Michael Pilat's suggestion of SplitBy too. – High Performance Mark Apr 23 '10 at 13:57
  • Very detailed explanations.Thanks! – Max Apr 23 '10 at 15:54
  • @Max: you're welcome. Another advantage of `Reap` and `Sow`, is that you can `Sow` multiple tags, i.e. if your datum fits into multiple categories, you can group your data as such. To do this, remove the curly braces around `f` in `SelectEquivalents` and have `f` return a list of the tags that the datum falls into. – rcollyer Apr 23 '10 at 16:27
  • To the person who down voted this, why? It is common courtesy to let the person know what about the answer warrants the down vote. Any one going to own up to this? – rcollyer Apr 19 '11 at 01:10
5

Here's a succinct way to do this using the new SplitBy function in version 7.0 that should be pretty fast:

SplitBy[Transpose[{points, clusterIndices}], Last][[All, All, 1]]

If you aren't using 7.0, you can implement this as:

Split[Transpose[{points, clusterIndices}], Last[#]==Last[#2]& ][[All, All, 1]]

Update

Sorry, I didn't see that you only wanted two groups, which I think of as clustering, not splitting. Here's some code for that:

FindClusters[Thread[Rule[clusterIndices, points]]]
Michael Pilat
  • 6,480
  • 27
  • 30
  • There were a lot of list utilities, like this one, that were added in 7.0 that I've never used due to my reliance on `Reap` and `Sow`. But, this is a good one that I'll have to remember. – rcollyer Apr 23 '10 at 14:24
  • At first I've tried to use SplitBy myself, but this doesn't work. Try to execute SplitBy[{1, 1, 1, 2, 2, 1, 1, 2, 2 }] and you will get {{1, 1, 1}, {2, 2}, {1, 1}, {2, 2}} with the length of 4, while I have only two clusters and it need to be 2. – Max Apr 23 '10 at 15:38
  • You still have the problem of clusterIndices = {2,1,...}. A sort is needed to be safe and ensure that the first gathered list represents the first cluster index. – Nicholas Wilson Apr 23 '10 at 19:41
  • Sorry, I misunderstood that you wanted clustering vs. splitting based on the list into smaller clusters at the transition points. Answer updated for doing that too... – Michael Pilat Apr 23 '10 at 19:51
  • Or yes, you could use `GatherBy`, but `FindClusters` is probably faster. – Michael Pilat Apr 23 '10 at 19:56
  • +1 For most legible answer, although the Reap Sow method is much more interesting. – Timo Apr 23 '10 at 20:04
  • +1, for illustrating two things about v 7.0 I did not know. `FindClusters` is definitely an interesting answer. But, looking through the docs, it may cluster indices that are too similar to each other, as the examples under both the `DistanceFunction` and `Method` options illustrates in help. That is not to say it is not useful, and it works fine in this case. But, you have to be judicious in its use. – rcollyer Apr 23 '10 at 20:30
  • @Nicholas: a sort is only necessary if he is using more than 2 indices. With only 2 indices, all he has to do is look at the first index to determine the order in which they will appear. Alternatively, if he uses `Reap` and `Sow`, he can use `{#1, #2}&` for the third param in `Reap`. Then the first element of the list is the index, and all of the points for that index are in the second element. If they are then needed in a specific order, they can then be sorted much more quickly. – rcollyer Apr 25 '10 at 03:38
4

How about this?

points[[
    Flatten[Position[clusterIndices, #]]
    ]] & /@
 Union[clusterIndices]
Mr.Wizard
  • 24,179
  • 5
  • 44
  • 125
Mark Fisher
  • 124
  • 1
  • Interesting, what made you think of this? – Davorak Apr 23 '10 at 01:28
  • Dunno. I use this kind of thing often (position lists to extract what I want). – Mark Fisher Apr 23 '10 at 02:36
  • Insteresting and very different from what others suggest.Thanks – Max Apr 23 '10 at 15:53
  • I suspect that the sort (implicit in `Union`) and the number of list traversals (via both `Position` and `Part`) would cause this to be rather inefficient. However, for a short list, it is definitely an interesting use of `Position`. – rcollyer Apr 23 '10 at 15:58
1

I don't know about 'better', but the more usual way in functional languages would not be to add indices to label each element (your MapIndexed) but instead to just run along each list:

Map[#1[[2]] &, 
 Sort[GatherBy[
   Thread[ {#1, #2} &[clusterIndices, points]],
   #1[[1]] &], #1[[1]][[1]] < #2[[1]][[1]] &], {2}]

Most people brought up in Lisp/ML/etc will write the Thread function out instantly is the way to implement the zip ideas from those languages.

I added in the Sort because it looks like your implementation will run into trouble if clusterIndices = {2[...,2],1,...}. On the other hand, I would still need to add in a line to fix the problem that if clusterIndices has a 3 but no 2, the output indices will be wrong. It is not clear from your fragment how you are intending to retrieve things though.

I reckon you will find list processing much easier if you refresh yourself with a hobby project like building a simple CAS in a language like Haskell where the syntax is so much more suited to functional list processing than Mathematica.

Nicholas Wilson
  • 9,435
  • 1
  • 41
  • 80
  • The part where you combine the two lists can be simplified to `Thread[{clusterindices, points}]`, as the pure function portion (`{#1, #2}`) is just wasted key strokes in this case. – rcollyer Apr 23 '10 at 13:28
1

If I think of something simpler I will add to the post.

Map[#[[1]] &, GatherBy[Thread[{points, clusterIndices}], #[[2]] &], {2}]
Davorak
  • 7,362
  • 1
  • 38
  • 48
1

My first step would be to execute

Transpose[{clusterIndices, points}]

and my next step would depend on what you want to do with that; Select comes to mind.

High Performance Mark
  • 77,191
  • 7
  • 105
  • 161