12

Let's say I have a 2D accumulator array in java int[][] array. The array could look like this:

(x and z axes represent indexes in the array, y axis represents values - these are images of an int[56][56] with values from 0 ~ 4500) array sample 1

or

array sample 1

What I need to do is find peaks in the array - there are 2 peaks in the first one and 8 peaks in the second array. These peaks are always 'obvious' (there's always a gap between peaks), but they don't have to be similar like on these images, they can be more or less random - these images are not based on the real data, just samples. The real array can have size like 5000x5000 with peaks from thousands to several hundred thousands... The algorithm has to be universal, I don't know how big the array or peaks can be, I also don't know how many peaks there are. But I do know some sort of threshold - that the peaks can't be smaller than a given value.

The problem is, that one peak can consist of several smaller peaks nearby (first image), the height can be quite random and also the size can be significantly different within one array (size - I mean the number of units it takes in the array - one peak can consist from 6 units and other from 90). It also has to be fast (all done in 1 iteration), the array can be really big.

Any help is appreciated - I don't expect code from you, just the right idea :) Thanks!


edit: You asked about the domain - but it's quite complicated and imho it can't help with the problem. It's actually an array of ArrayLists with 3D points, like ArrayList< Point3D >[][] and the value in question is the size of the ArrayList. Each peak contains points that belong to one cluster (plane, in this case) - this array is a result of an algorithm, that segments a pointcloud . I need to find the highest value in the peak so I can fit the points from the 'biggest' arraylist to a plane, compute some parameters from it and than properly cluster most of the points from the peak.
Jaa-c
  • 5,017
  • 4
  • 34
  • 64
  • 1
    What is the definitive separator between peaks? That is, when do two peaks belong together (as in ex. 1) and when do you want them to be separate (as in ex. 2)? – DerMike Mar 06 '12 at 16:00
  • @JamesMontagne: I've tried some bruteforce algorithms that do not work well, I just think there could be some sort clever solution I don't see :) – Jaa-c Mar 06 '12 at 16:07
  • It does not have an exact definition, that's kinda the point to define a peak and a separator... Peak is simply defined by several values in neighborhood cells, that are bigger than the values around. The gap between can be quite random, but there should be always several cells with low values (lets say 1000x smaller) between peaks. – Jaa-c Mar 06 '12 at 16:11
  • Are there speed constraints? 25 million items is a lot, but not so large it couldn't be brute-forced in a few seconds. Or are you simply asking *how* to brute-force it? – BlueRaja - Danny Pflughoeft Mar 06 '12 at 16:12
  • @BlueRaja-DannyPflughoeft: I'm asking how to find all the correct peaks... It's not that easy as go through all the values... – Jaa-c Mar 06 '12 at 16:22
  • @Jaa-c: Er, sure it is. See my answer. – BlueRaja - Danny Pflughoeft Mar 06 '12 at 16:33
  • If you definitely want all the peaks, take a look at my answer. But as it is a brute-force algorithm, it is very sensitive to the exact parameters (minimum distance of peaks, number of elements over the minimum treshold and so on), so it may not work in practice. – biziclop Mar 06 '12 at 16:34
  • @Jaa-c My answer would be O(n), you really can't get better without potentially missing some peak. – NominSim Mar 06 '12 at 16:53

3 Answers3

7

He's not interested in estimating the global maximum using some sort of optimization heuristic - he just wants to find the maximum values within each of a number of separate clusters.

These peaks are always 'obvious' (there's always a gap between peaks)

Based on your images, I assume you mean there's always some 0-values separating clusters? If that's the case, you can use a simple flood-fill to identify the clusters. You can also keep track of each cluster's maximum while doing the flood-fill, so you both identify the clusters and find their maximum simultaneously.

This is also as fast as you can get, without relying on heuristics (which could return the wrong answer), since the maximum of each cluster could potentially be any value in the cluster, so you have to check them all at least once.


Note that this will iterate through every item in the array. This is also necessary, since (from the information you've given us) it's potentially possible for any single item in the array to be its own cluster (which would also make it a peak). With around 25 million items in the array, this should only take a few seconds on a modern computer.

BlueRaja - Danny Pflughoeft
  • 84,206
  • 33
  • 197
  • 283
  • 1
    That's more or less what I need - identify the clusters and then find the maximum within the cluster. And yes, I have some sort of 0-values, I guess I can find some treshold that would work. I'll check the method you mentioned, it seems interesting. – Jaa-c Mar 06 '12 at 16:46
  • If you look at this as an image-processing problem, then this is thresholding (with a threshold condition of >0), followed by [connected component colouring](http://en.wikipedia.org/wiki/Connected-component_labeling). That algorithm is, i think, simpler than a flood fill. I think a simple modification of the disjoint-set structure would let you track the maximum value in the sets; you might even be able to dispense with the set structure if the maximum is all you need. – Tom Anderson Mar 06 '12 at 17:04
  • @Tom: I disagree that that is simpler than flood-fill, but both would work, they are both extremely simple, and both would be about the same speed. – BlueRaja - Danny Pflughoeft Mar 06 '12 at 17:44
  • @BlueRaja-DannyPflughoeft: So if I understand you correctly, you suggest to iterate through the array and if a value > treshold is found, than use floodfill to identify the cluster, right? – Jaa-c Mar 06 '12 at 18:25
  • @Jaa-c: Right. You should also keep track of which clusters have already been identified, so two values > threshold from the same cluster don't make you unnecessarily do a flood-fill twice. – BlueRaja - Danny Pflughoeft Mar 06 '12 at 18:31
  • @BlueRaja-DannyPflughoeft: I'm not sure if there is an efficient way how to exclude already found clusters from the current search... But I guess I can handle the rest... I'm gonna try it and I'll get back. Thanks. – Jaa-c Mar 06 '12 at 19:08
  • @Jaa-c: Create a bit-array with as many entries as the main-array. Every time you visit entry `(x,y)` in the main array (either while iterating or while flood-filling), set entry `(x,y)` to `true` in the bit-array. Check the bit-array before doing any flood-filling, to see if you've visited that entry before. – BlueRaja - Danny Pflughoeft Mar 06 '12 at 19:09
  • @BlueRaja-DannyPflughoeft: that's an option, but if the initial array is big, it takes a lot memory to allocate one more array with the same size... I'll see what I can do, I can try to use BitSet or something like that... – Jaa-c Mar 06 '12 at 19:36
  • @Jaa-c: Not the same size, just with `mainArray.Length` number of bits. So for a 25-million item array, the bit-array would be about 3 MB – BlueRaja - Danny Pflughoeft Mar 06 '12 at 19:39
  • @BlueRaja-DannyPflughoeft: I'm just stupid and can't count to 8. Thanks! – Jaa-c Mar 06 '12 at 19:45
  • Flood-fill involves managing a queue. But then connected components involves managing a union-find structure. So, no, perhaps it's not simpler. – Tom Anderson Mar 07 '12 at 00:06
2

This might not be an optimal solution, but since the problem sounds somewhat fluid too, I'll write it down.

  1. Construct a list of all the values (and coordinates) that are over your minimum treshold.
  2. Sort it in descending order of height.
  3. The first element will be the biggest peak, add it to the peak list.
  4. Then descend down the list, if the current element is further than the minimum distance from all the existing peaks, add it to the peak list.

This is a linear description but all the steps (except 3) can be trivially parallelised. In step 4 you can also use a coverage map: a 2D array of booleans that show which coordinates have been "covered" by a nearby peak.

(Caveat emptor: once you refine the criteria, this solution might become completely unfeasible, but in general it works.)

biziclop
  • 48,926
  • 12
  • 77
  • 104
  • This has 2 problems - firts is how to define the 'minimum distance' in step 4. In the first image, it would have to be much larger than in the second array. Second, this would be quite slow. – Jaa-c Mar 06 '12 at 16:28
  • @Jaa-c Yes, it is slow. But how slow it is depends on your data and an exact formulation of the problem. Finding hundreds of thousands of peaks isn't going to be quick, whichever way you're doing it. But I don't really understand the minimum distance problem. Surely the "minimum distance between peaks" is a parameter that you have to set. – biziclop Mar 06 '12 at 16:47
  • It's true, that there will not be many peaks, but they can consist from a large number of values higher than threshold. And I simply don't know, how I would set the minimum distance. In the first image, there are several local maximums in one peak, but the distance between them is about the same as the distance between 3 closest peeks in the second case. – Jaa-c Mar 06 '12 at 16:56
  • @Jaa-c Okay, reading through more recent comments I realised that by "peak" you mean both an island and the peak itself. In that case go with a variation of flood fill. I thought you can have two peaks on one island if they're far enough from each other. – biziclop Mar 06 '12 at 16:59
  • That's right, I'm sorry for the confusion. I'm reading about the flood fill right now, seems promising... – Jaa-c Mar 06 '12 at 17:05
1

Simulated annealing, or hill climbing are what immediately comes to mind. These algorithms though will not guarantee that all peaks are found.

However if your "peaks" are separated by values of 0 as the gap, maybe a connected components analysis would help. You would label a region as "connected" if it is connected with values greater than 0(or if you have a certain threshold, label regions as connected that are over that threshold), then your number of components would be your number of peaks. You could also then do another pass of the array to find the max of each component.

I should note that connected components can be done in linear time, and finding the peak values can also be done in linear time.

NominSim
  • 8,447
  • 3
  • 28
  • 38
  • I'm not the downvoter, but i'd give -1 for hill climbing (because i don't think it straightforwardly helps you identify all the peaks, without also finding sub-peaks that should be ignored) and +1 for connected components analysis. – Tom Anderson Mar 06 '12 at 17:25
  • 2
    @TomAnderson Yeah, that's why I didn't answer with hill climbing, but rather mentioned that that is what comes to mind with this problem. – NominSim Mar 06 '12 at 17:27
  • I don't see a downvote - maybe someone just retracted their vote. – BlueRaja - Danny Pflughoeft Mar 06 '12 at 17:48
  • @NominSim: This is a valid answer, I used the flood-filling method, which is quite similar. +1 at least... – Jaa-c Mar 07 '12 at 01:08