10

I am having some trouble developing a suitably fast binning algorithm in Mathematica. I have a large (~100k elements) data set of the form T={{x1,y1,z1},{x2,y2,z2},....} and I want to bin it into a 2D array of around 100x100 bins, with the bin value being given by the sum of the Z values that fall into each bin.

Currently I am iterating through each element of the table, using Select to pick out which bin it is supposed to be in based on lists of bin boundaries, and adding the z value to a list of values occupying that bin. At the end I map Total onto the list of bins, summing their contents (I do this because I sometimes want to do other things, like maximize).

I have tried using Gather and other such functions to do this but the above method was ridiculously faster, though perhaps I am using Gather poorly. Anyway It still takes a few minutes to do the sorting by my method and I feel like Mathematica can do better. Does anyone have a nice efficient algorithm handy?

Ben Farmer
  • 2,387
  • 1
  • 25
  • 44
  • 1
    Please post the code you are already using, otherwise it is hard to know if a solution with e.g. `Gather` is actually an improvement. – Mr.Wizard Nov 18 '11 at 07:02
  • Let me see if I have this right: you are binning Z values by their corresponding X and Y values, correct? – Mr.Wizard Nov 18 '11 at 07:08
  • Are `x,y,z` reals or integers? If `z` is an integer, there may be simpler ways: `BinCounts[Join @@ (ConstantArray[{#1, #2}, #3] & @@@ data)]` – Szabolcs Nov 18 '11 at 08:58
  • @Szabolcs, I do not understand your comment above. – Mr.Wizard Nov 18 '11 at 09:13
  • @Mr.Wizard I mean that *if* the meaning of `z` is a "count" of something, then we might as well multiply each entry `z` times and use the built-in and fast `BinCounts` function. – Szabolcs Nov 18 '11 at 09:42
  • @Szabolcs, (1) what if Z values are large? (2) is `BinCounts` really faster? It used to be very slow. In v7 it is better, but still not superlative. How is it in v8? – Mr.Wizard Nov 18 '11 at 09:59
  • @Mr.Wizard If `z` is large then this is simply a bad solution!! I knew the old, package-provided `BinCounts` used to be pretty slow, but I thought it got fixed in the new version ... I remember I had a custom implementation of 2D bin counting (from the v5 era) that was much faster for my use case than the one from the package, but I can't find it now. Do you have your own optimized version to test against the built-in v8 `BinCounts`? I'm curious about this. – Szabolcs Nov 18 '11 at 10:08
  • @Szabolcs, no, nothing robust, and not 2D. It would be interesting to explore this, but not right now. – Mr.Wizard Nov 18 '11 at 10:16

4 Answers4

12

Here is a method based on Szabolcs's post that is about about an order of magnitude faster.

data = RandomReal[5, {500000, 3}];
(*500k values*)
zvalues = data[[All, 3]];

epsilon = 1*^-10;(*prevent 101 index*)
(*rescale and round (x,y) coordinates to index pairs in the 1..100 range*)
indexes = 1 + Floor[(1 - epsilon) 100 Rescale[data[[All, {1, 2}]]]];

res2 = Module[{gb = GatherBy[Transpose[{indexes, zvalues}], First]}, 
    SparseArray[
     gb[[All, 1, 1]] -> 
      Total[gb[[All, All, 2]], {2}]]]; // AbsoluteTiming

Gives about {2.012217, Null}

AbsoluteTiming[
 System`SetSystemOptions[ 
  "SparseArrayOptions" -> {"TreatRepeatedEntries" -> 1}];
 res3 = SparseArray[indexes -> zvalues];
 System`SetSystemOptions[ 
  "SparseArrayOptions" -> {"TreatRepeatedEntries" -> 0}];
 ]

Gives about {0.195228, Null}

res3 == res2
True

"TreatRepeatedEntries" -> 1 adds duplicate positions up.

  • 2
    could you expand a bit on "adds duplicate positions up"? I am having trouble understanding what it does. – acl Nov 20 '11 at 17:54
  • 2
    If you have rules {2,3}->1, {2,3}->10 and, say, {2,3}->2 the value at mat[[2,3]] will be 1+10+2; "TreatRepeatedEntries" means to add up the value if position entries are repeated. The default behavior of SparseArray is to not do that. TreatRepeatedEntries"->1 activates that feature - with this you can write very efficient code. –  Nov 20 '11 at 18:07
  • Why are these kinds of things undocumented? You showed SparseArray properties before, and I believe I asked if there were other similar useful undocumented features, but this was not forthcoming. Since these are not in the official documentation, but you are willing to divulge them, would you please consider putting together an index of these hidden but powerful methods? – Mr.Wizard Nov 21 '11 at 01:32
  • You want unfinished options to be documented? No, this is not going to happen. And clearly ->number is not a finished design. But even if that were ->True/False, then still it does not get the full generality conceivable. –  Nov 21 '11 at 10:36
  • 3
    Id did show this on several occasions, at least once on MathGroup: http://forums.wolfram.com/mathgroup/archive/2010/Dec/msg00335.html. –  Nov 21 '11 at 10:44
  • 4
    I only noticed this reply now. I wish this were an option to SparseArray instead of a system option. – Szabolcs Nov 25 '11 at 08:53
5

I intend to do a rewrite of the code below because of Szabolcs' readability concerns. Until then, know that if your bins are regular, and you can use Round, Floor, or Ceiling (with a second argument) in place of Nearest, the code below will be much faster. On my system, it tests faster than the GatherBy solution also posted.


Assuming I understand your requirements, I propose:

data = RandomReal[100, {75, 3}];

bins = {0, 20, 40, 60, 80, 100};

Reap[
  Sow[{#3, #2}, bins ~Nearest~ #] & @@@ data,
  bins,
  Reap[Sow[#, bins ~Nearest~ #2] & @@@ #2, bins, Tr@#2 &][[2]] &
][[2]] ~Flatten~ 1 ~Total~ {3} // MatrixForm

Refactored:

f[bins_] := Reap[Sow[{##2}, bins ~Nearest~ #]& @@@ #, bins, #2][[2]] &

bin2D[data_, X_, Y_] := f[X][data, f[Y][#2, #2~Total~2 &] &] ~Flatten~ 1 ~Total~ {3}

Use:

bin2D[data, xbins, ybins]
Mr.Wizard
  • 24,179
  • 5
  • 44
  • 125
  • @Szabolcs, can you suggest how I might make it more readable? – Mr.Wizard Nov 18 '11 at 08:55
  • I will once I figure out why it doesn't give the same result as mine ... those numbers in `bins`, they are the *centres* of the bins, right? – Szabolcs Nov 18 '11 at 09:03
  • @Szabolcs, As you can see I am using Nearest, so you can deduce that part yourself, but if I am not mistaken, yes. Also, it is entirely possible that it is simply dysfunctional. :o) – Mr.Wizard Nov 18 '11 at 09:14
  • 2
    My first comment was just a cry of frustration, not directly aimed at you :-) But seriously, I do think it's objectively hard to read (despite "easy to read" being a subjective and cultural thing, depending on what we're used to). Take for example the line `Reap[Sow[#1, bins~Nearest~#2] & @@@ #2, bins, Tr@#2 &]`. There are three occurrences of `#2`, each of which are arguments of a different function. There's the misleading use of `Tr` as a shorthand for `Total`. Unorthodox use of inline notation. – Szabolcs Nov 18 '11 at 09:34
  • All these put together, plus the need to consult the docs on features I use less often (like the second arg of `Total`), are just too overwhelming. Of course the refactoring helps. I still think that deciphering this will pose a considerable difficulty for anyone who's not an experienced user. – Szabolcs Nov 18 '11 at 09:37
  • These comments are not about the method, but about the notation. Breaking it down to pieces always helps (you already did that). My only suggestion is to describe what `f` does, including both of its arguments, perhaps through a small out-of-context example. – Szabolcs Nov 18 '11 at 09:39
  • By all means, do keep the method, just please make it less effort for us to understand what's going on :-) – Szabolcs Nov 18 '11 at 09:44
  • @Szabolcs I shall. You did not respond to my question about irregular bins. Do you think that based on the question it is reasonable to assume that bins are regular? – Mr.Wizard Nov 18 '11 at 10:00
  • The OP said "array of around 100x100 bins", which *suggests* regular ones, but he also said "based on lists of bin boundaries" which suggests irregular ones. We have solutions for both, let's wait until he comes back and tells us what he needs. As I understood, the bin-boundary list was an implementation detail, and not a required input. Otherwise I think the fastest regular-bin method would be one that can be `Compile`d to efficient C code. – Szabolcs Nov 18 '11 at 10:11
  • 2
    I'd like to suggest a change. Using `Nearest` is fine, but it must do a lot of preprocessing every time it is used. I'd suggest wrapping everything in `f` with `With[{nn=Nearest[bins]}, ...]` and use `nn[#]` instead of `bins ~Nearest~ #`. This does all the preprocessing required once, and on my machine over 10^6 points it shaves off nearly 7 seconds. – rcollyer Nov 18 '11 at 13:49
  • On my machine, using `NeighborFunction` this takes 10.8 secs (17.6 secs, without `NeighborFunction`). Szalbolcs solutions take 2.56 secs and 1.99 secs, respectively. So, a distinct improvement, but still not in the same class. – rcollyer Nov 18 '11 at 16:15
  • @rcollyer That's a good suggestion. I forgot about that function. If you use e.g. `Floor[_, 20]` instead of Nearest, how do the timings compare? – Mr.Wizard Nov 18 '11 at 22:43
  • Now running v.8.0.4, 22.96 secs for original form, 11.21 secs for my form, and 7.33 secs for `Floor` form. – rcollyer Nov 19 '11 at 02:35
  • Slower on version 8 then? I have seen other timings go up from v7 to v8, which is one of the reasons I still use v7. – Mr.Wizard Nov 19 '11 at 02:57
  • @Mr.Wizard How could I convert your solution into one which calculates the meanvalue instead of the sum? – rmma May 21 '14 at 13:12
4

Here's my approach:

data = RandomReal[5, {500000, 3}]; (* 500k values *)

zvalues = data[[All, 3]];

epsilon = 1*^-10; (* prevent 101 index *)

(* rescale and round (x,y) coordinates to index pairs in the 1..100 range *)    
indexes = 1 + Floor[(1 - epsilon) 100 Rescale[data[[All, {1, 2}]]]];

(* approach 1: create bin-matrix first, then fill up elements by adding  zvalues *)
res1 = Module[
    {result = ConstantArray[0, {100, 100}]},
    Do[
      AddTo[result[[##]], zvalues[[i]]] & @@ indexes[[i]], 
      {i, Length[indexes]}
    ];
    result
    ]; // Timing

(* approach 2: gather zvalues by indexes, add them up, convert them to a matrix *)
res2 = Module[{gb = GatherBy[Transpose[{indexes, zvalues}], First]},
    SparseArray[gb[[All, 1, 1]] -> (Total /@ gb[[All, All, 2]])]
    ]; // Timing

res1 == res2

These two approaches (res1 & res2) can handle 100k and 200k elements per second, respectively, on this machine. Is this sufficiently fast, or do you need to run this whole program in a loop?

Szabolcs
  • 24,728
  • 9
  • 85
  • 174
  • Is there a reason you are using the long form of `AddTo` and `Part`? – Mr.Wizard Nov 18 '11 at 08:56
  • I fixed `Part`. With `AddTo`, it just avoids some parenthesis. I tend to use this form when I put `Set` inside a `Function`, but I don't have strong feelings about the style. – Szabolcs Nov 18 '11 at 09:02
  • This appears to be much faster than what I did, but can you easily handle irregular bins? (I used `Nearest` for this reason.) – Mr.Wizard Nov 18 '11 at 09:08
  • 1
    Total /@ gb[[All, All, 2]] is equivalent to Total[gb[[All, All, 2]], {2}]. –  Nov 20 '11 at 17:26
3

Here's my approach using the function SelectEquivalents defined in What is in your Mathematica tool bag? which is perfect for a problem like this one.

data = RandomReal[100, {75, 3}];
bins = Range[0, 100, 20];
binMiddles = (Most@bins + Rest@bins)/2;
nearest = Nearest[binMiddles];

SelectEquivalents[
   data
   ,
   TagElement -> ({First@nearest[#[[1]]], First@nearest[#[[2]]]} &)
   ,
   TransformElement -> (#[[3]] &)
   ,
   TransformResults -> (Total[#2] &)
   ,
   TagPattern -> Flatten[Outer[List, binMiddles, binMiddles], 1]
   , 
   FinalFunction -> (Partition[Flatten[# /. {} -> 0], Length[binMiddles]] &)
]

If you would want to group according to more than two dimensions you could use in FinalFunction this function to give to the list result the desired dimension (I don't remember where I found it).

InverseFlatten[l_,dimensions_]:= Fold[Partition[#, #2] &, l, Most[Reverse[dimensions]]];
Community
  • 1
  • 1
faysou
  • 1,142
  • 11
  • 25
  • As I mentioned on Mr.Wizard's solution, you'll want to use `NearestFunctions` instead of reinvoking `Nearest` everytime. In other words, calculate `Nearest[xbins]` and `Nearest[ybins]` first, and you get a non-trivial speed-up. – rcollyer Nov 18 '11 at 14:01
  • Also, why `TransformResults -> (Append[#1, Total[#2]] &)` instead of `TransformResults -> ({#1, Total[#2]} &)`? – rcollyer Nov 18 '11 at 14:03
  • 2
    Because #1 is a list and I want a "plotable" result. – faysou Nov 18 '11 at 14:36
  • Absolutely. I should have seen that myself. – rcollyer Nov 18 '11 at 14:44
  • The OP is talking about lists of bin *boundaries*. You are using lists of bin *midpoints*. – Sjoerd C. de Vries Nov 18 '11 at 21:42
  • @Sjoerd I made the same mistake. But, the OP has not returned to clarify the question, so I still do not know if we are dealing with regular or irregular bins. – Mr.Wizard Nov 18 '11 at 22:44
  • @Mr.Wizard I noticed, but your list of comments was already pretty long ;-) BTW did I tell you – Sjoerd C. de Vries Nov 18 '11 at 23:02
  • @Mr.Wizard Hah! Got you pondering there, didn't I? Got to get some sleep first. Maybe tomorrow ;-) – Sjoerd C. de Vries Nov 18 '11 at 23:21
  • @Sjoerd I corrected my answer to match the bin definition. Also I managed to use the idea of Mr.Wizard to have a multi-dimensional grouping using the pattern argument of Reap, thus I use one Reap instead of several like Mr.Wizard. I already had this problem in the PivotTable question I had asked, I'm happy to have found how to do it using SelectEquivalents as I tend to often use this function. – faysou Nov 18 '11 at 23:31