0

Setup:

Visualize a large array of numbers where each number represents a height of a bar on a bar graph.

Ex: [5, 4, 3, 7, 2, 3, 1, 12]

       █
       █
       █
       █
       █
   █   █
   █   █
█  █   █
██ █   █
████ █ █
██████ █
████████

Analyze:

This is the bar graph of the previous numbers. What I need to find is the area containted in the graph in number of open (or unfilled) units.

Solution (Pt.1):

To go about this I made an algorithm to calculate all the peaks in the array.

This returns : [5, 7, 3, 12] as well as another list with the indices of each entry, [0,3,5,7]

To us, there are only three important peaks to find the area. The 5, the 7, and the 12. We can then break it down like this.

The amount of open area in between the 5 and 7 is (general rule):

(([Index Of Larger] - [Index Of Smaller] - [1])*[SmallerValue]) - [Values Of All In B/W]

So the area of the first section would be (2*5) - (4+3) or 10-7 or 3. THis makes sense because if you look at the graph you see there is an empty L shaped section that you could fit 3 units of say, water in without it overflowing. If you repeat this with the second section you get its correct area as well.

My issue is developing an algorithm to go from ALL PEAKS to the IMPORTANT PEAKS.

Misleading:

In this case it is extremely easy to see how that could be done. You simply write an algo to find that the 3 is smaller than the 7 and the 12 so get rid of it and return a refined version of the peaks.

However it is not always that simple.

More Difficult Example:

I have an array:

[5, 4, 3, 7, 2, 3, 1, 12, 9, 10, 5, 3, 6, 8, 5, 6, 4, 7, 6, 9, 4, 11, 11, 4, 1, 2, 1]

Running it through a basic custom peak finding algorithm O(N) It returns:

[5, 7, 3, 12, 10, 8, 6, 7, 9, 11, 11, 4, 2]

In this example, we see the same problem in the first part of this question, however, after the 12 in this peak list, a human can easily see that the next most important peak to look at are the two 11s, the 4, and 2. So I need a way to go from:

[5, 7, 3, 12, 10, 8, 6, 7, 9, 11, 11, 4, 2]

To:

[5, 7, 12, 11, 11, 4, 2]

The above array is a list of the 'important' peaks that are necessary to find the area and again visualize the open blocks as if they'd contain water or something so that they are limited to the lowest immediate peak before overflowing.

To better visualize this more full, second example I have a picture of the graph and all of its peaks and data points here.

Thank you.

Dominic Farolino
  • 1,362
  • 1
  • 20
  • 40
  • So just to be clear, you have an O(NxM) algorithm, where N is the number of bars and M is the maximum bar height. But you want an O(N) algorithm. Is that correct? – senderle Jan 03 '15 at 20:34
  • No, I am not worried about the bar height. The only thing I am concerned with is going from a list with `All the peaks` to a list with `All the important peaks`. I hope that makes sense. – Dominic Farolino Jan 03 '15 at 20:35
  • It doesn't make sense. Isn't all the unfilled units just `max(N)*len(N)-sum(N)` where `N` is the list? – Mark Tolonen Jan 03 '15 at 20:36
  • For example, `[5, 7, 3, 12, 10, 8, 6, 7, 9, 11, 11, 4, 2]`...I need an algorithm smart enough to say at the 3 doesn't matter because its lower than 7 and 12, and the 10, 8, 7, 6 and 9 don't matter because they fall in between 12 and the first 11 – Dominic Farolino Jan 03 '15 at 20:37
  • Well, I see a very obvious algorithm that is O(NxM) that gives correct results. Just scan each row, marking the first bar that appears in the row. Then when a second bar appears, count each of the cells between them. Then mark the second bar, and if a third bar appears, count each of the cells between those. And so on. Stop when there's only one bar left. (So really it's the next-to-maximum bar height.) Is that not a good enough solution? – senderle Jan 03 '15 at 20:47
  • Actually, @MarkTolonen solution wouldn't work... – Dominic Farolino Jan 03 '15 at 20:58
  • 1
    @senderle The inefficiency is unacceptable for this problem as the max value could be in the thousands – Dominic Farolino Jan 03 '15 at 21:01
  • @DomFarolino in [1, 2, 4] does 2 matter? – Alleo Jan 03 '15 at 21:17
  • @Alleo if that is a full array then nothing matters because all of the water would just fall of the left and it couldn't hold anything. – Dominic Farolino Jan 03 '15 at 21:26

3 Answers3

1

You could solve this problem by considering two values:

  1. Max peak so far, starting from the left
  2. Max peak so far, starting from the right

And don't accept a peak if it is inferior to both, because it will be under water.

Rémi
  • 3,705
  • 1
  • 28
  • 39
1

I think this handles all conditions but all the maximum calculations will slow it down for large data sets. I used IPython Notebook to graph it. It is basically @Rémi's idea:

For any data point:

  1. Take the maximum point to the left and the maximum point to the right. a. If at the ends assume zero.
  2. Take the minimum of the two maximum points.
  3. If the datapoint is below that minimum, it is underwater and return the difference else zero.

It could be optimized by computing the left maximum as it scans to the right, and computing the right maximums for each position ahead of time in a single pass from right to left.

The algorithm as is took about 4.1 seconds to do 10,000 data points on my system.

The unfilled area (yellow) will be sum(C):

%matplotlib inline
import matplotlib.pyplot as plt
import random

def contribution(L,i):
    max_left = 0 if i==0 else max(L[:i])
    max_right = 0 if i==len(L)-1 else max(L[i+1:])
    lower = min(max_left,max_right)
    return 0 if lower < L[i] else lower - L[i]

N = [random.randint(0,12) for i in range(50)]
C = [contribution(N,i) for i in range(len(N))]

ind = list(range(len(N))) # the x locations for the groups
width = 1                 # the width of the bars: can also be len(x) sequence

p1 = plt.bar(ind, N, width, color='r')
p2 = plt.bar(ind, C, width, color='y',bottom=N)

enter image description here

Edit

Here's a faster version that implements the optimization I mentioned above. It computes one million datapoints in 1.33 seconds, but uses a smaller amount for graphing below. I don't see how it could be done in one pass, given that a cell needs to know the maximum to its left and right and there could be multiple points equal to the maximum in either direction.

%matplotlib inline
import matplotlib.pyplot as plt
import random

def right_maximums(L):
    '''Given list L, compute [max(L[i+1:] for i in range(len(L)-1)]+[0] more efficiently.
    
    This gives the maximum cell to the right of the current cell.
    Example: [1,2,3,4,5,4,3,2,1] -> [5,5,5,5,4,3,2,1,0]
    '''
    N = [0]
    for i,v in enumerate(L[:0:-1]):
        N.append(max(N[i],v))
    return N[::-1]

def contribution(N):
    '''In a bar graph of data N, compute how much "water" a data valley, assuming water
    spills off the sides of the bar graph.
    '''
    rmaxs = right_maximums(N) # compute maximums to the right of a data point in advance.
    lmax = 0 # compute maximums to the left as we go.
    C = []
    for i,v in enumerate(N):
         # find the lower of the left and right maximum.
        lower = min(lmax,rmaxs[i])
        # if the data point is higher than the maximums, it won't hold water,
        # else it holds the difference between the lower maximum and its value.
        C.append(0 if lower < v else lower - v)
        lmax = max(lmax,v)
    return C

N = [random.randrange(0,50) for i in range(50)]
C = contribution(N)

ind = list(range(len(N))) # the x locations for the groups
width = 1                 # the width of the bars: can also be len(x) sequence

p1 = plt.bar(ind, N, width, color='r')
p2 = plt.bar(ind, C, width, color='y',bottom=N)
Community
  • 1
  • 1
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • Yeah, this seems promising but the slicing really slows it down I bet. There must be a way to do it without slicing. Seems like there's a pretty elegant 1-pass solution in here. (Deleted some obsolete comments, BTW.) – senderle Jan 04 '15 at 04:56
  • 1
    @senderle the duplicate posted in the question comments does it in one pass, but just computes the total, not per column. – Mark Tolonen Jan 04 '15 at 05:06
  • this is _slow_ since this is O(N^2), while there is obvious O(n) solution pointed by Eyal. And see my comment for his answer. – Alleo Jan 04 '15 at 09:37
  • @Alleo, yes the code is O(N^2), but the algorithm as described is the basis for a one-pass O(N), O(1) space solution that's better even than the one posted by J.S. Sebastian at the duplicate -- which I didn't see before. Regretfully, I now have to wield the mjollnir... – senderle Jan 04 '15 at 17:30
  • See [here](http://stackoverflow.com/a/27768575/577088) for what I mean about a one-pass solution that's better (in the sense of using only O(1) space) than the one already posted at the duplicate. It can process ten million datapoints in 3 seconds. That's what I _thought_ you were describing, though after looking at your code more closely, you seem to be doing something else entirely. – senderle Jan 04 '15 at 18:45
  • I updated with a faster version that performs the optimization I mentioned. `numpy` would make it even faster. I don't think it can be done correctly in one pass, that's why I graph it to make sure it handles random data correctly. – Mark Tolonen Jan 05 '15 at 07:49
0

It can be done in 3 passes:

  public static int areaContained(int[] arr) {
    int[] maxL = new int[arr.length];
    int[] maxR = new int[arr.length];

    int max = 0;
    for (int i = 0; i < arr.length; i++) {
      max = Math.max(arr[i], max);
      maxL[i] = max;
    }

    max = 0;
    for (int i = arr.length - 1; i >= 0; i--) {
      max = Math.max(arr[i], max);
      maxR[i] = max;
    }

    int total = 0;
    for (int i = 0; i < arr.length; i++) {
      int areaI = Math.min(maxL[i], maxR[i]) - arr[i];
      if (areaI > 0)
        total += areaI;
    }

    return total;
  }

The main idea is that the contribution of bar i is determined by the combination of arr[i], max value after i, and max value before i.

Eyal Schneider
  • 22,166
  • 5
  • 47
  • 78
  • python one-liner for that: `numpy.minimum(numpy.maximum.accumulate(x), numpy.maximum.accumulate(x[::-1])[::-1]).sum() - numpy.sum(x)` :) – Alleo Jan 04 '15 at 00:26