1

What is the fastest way to calculate a x second rolling average of an array in ruby?

I have two arrays of data from a bicycle ride. The time is when the corresponding speed value was read during the ride. You'll notice that the readings were not taken every second. For this reason I don't believe I can just increment the rolling array by one.

speed = [0, 15, 17, 19, 18, 22, 24, 28, 22, 17, 16, 14, 15, 15, 15, 0, 15, 19, 21, 25, 26, 24, 24] 
time = [0, 1, 2, 3, 5, 6, 7, 8, 10, 11, 12, 13, 15, 16, 17, 18, 20, 21, 22, 23, 25, 26, 27] 

I have tried something like the following (calculates a rolling 5 second average and puts it into an array), but it's pretty slow for large arrays and multiple intervals (takes 8 minutes to calculate all the intervals from a 1 hour bike ride, 1..3600):

duration = time.max

interval_average = []
time_hash = Hash[time.map.with_index.to_a] 

roll_start = 0
roll_stop = 5

for i in 1..(duration+1) do
    start = time_hash[roll_start]
    stop = time_hash[roll_stop]

    rolling_array = speed[start..stop]

    avg_value = mean(rolling_array)

    interval_average.push(avg_value)

    roll_start += 1
    roll_stop += 1
end

In my own code I'm ignoring the exceptions and pushing 0 instead, as I'm really just interested in finding the max of the x second averages in the end.

  • `speed[start..stop]` is going to allocate a sub-array, which is probably causing some substantial GC thrash. Your goal should probably be to eliminate allocations where possible; reuse of intermediate arrays will produce substantial benefits. – Chris Heald Apr 07 '15 at 20:54
  • @ChrisHeald I doubt that allocations are the culprit here. `arr = 10_000_000.times.to_a; Benchmark.measure { 1_000_000.times { ar[100..-2] } }.real #=> 0.17680915212258697` – Ismael Abreu Apr 07 '15 at 21:05
  • 2
    Start by profiling your code to see where the time is going (eg ruby-prof) – Frederick Cheung Apr 07 '15 at 21:45

1 Answers1

2

I'm not sure about its performance, but here's another approach that you can test for finding the maximum of rolling averages over some fixed length of time.

speed = [0, 15, 17, 19, 18, 22, 24, 28, 22, 17, 16, 14, 15, 15, 15, 0, 15, 19, 21, 25, 26, 24, 24] 
time = [0, 1, 2, 3, 5, 6, 7, 8, 10, 11, 12, 13, 15, 16, 17, 18, 20, 21, 22, 23, 25, 26, 27] 

interval_length = 5 # seconds

speed.zip(time)                                                     # 1
     .each_cons(interval_length)                                    # 2
     .select { |i| i.last.last - i.first.last == interval_length}   # 3
     .map { |i| i.map(&:first).reduce(:+) / interval_length.to_f }  # 4
     .max                                                           # 5

Breaking it down into components with intermediate results:

1) Pair each speed reading with the time it was taken.

# => [[0, 0], [15, 1], [17, 2], [19, 3], [18, 5], [22, 6], [24, 7], [28, 8], [22, 10], [17, 11], [16, 12], [14, 13], [15, 15], [15, 16], [15, 17], [0, 18], [15, 20], [19, 21], [21, 22], [25, 23], [26, 25], [24, 26], [24, 27]]

2) Section off the above into consecutive runs of interval_length, in this case 5. This will give you an Enumerator object, but using to_a we can see the intermediate result looks like this:

# => [[15, 1], [17, 2], [19, 3], [18, 5], [22, 6]], [[17, 2], [19, 3], [18, 5], [22, 6], [24, 7]], [[19, 3], [18, 5], [22, 6], [24, 7], [28, 8]], [[18, 5], [22, 6], [24, 7], [28, 8], [22, 10]], [[22, 6], [24, 7], [28, 8], [22, 10], [17, 11]], [[24, 7], [28, 8], [22, 10], [17, 11], [16, 12]], [[28, 8], [22, 10], [17, 11], [16, 12], [14, 13]], [[22, 10], [17, 11], [16, 12], [14, 13], [15, 15]], [[17, 11], [16, 12], [14, 13], [15, 15], [15, 16]], [[16, 12], [14, 13], [15, 15], [15, 16], [15, 17]], [[14, 13], [15, 15], [15, 16], [15, 17], [0, 18]], [[15, 15], [15, 16], [15, 17], [0, 18], [15, 20]], [[15, 16], [15, 17], [0, 18], [15, 20], [19, 21]], [[15, 17], [0, 18], [15, 20], [19, 21], [21, 22]], [[0, 18], [15, 20], [19, 21], [21, 22], [25, 23]], [[15, 20], [19, 21], [21, 22], [25, 23], [26, 25]], [[19, 21], [21, 22], [25, 23], [26, 25], [24, 26]], [[21, 22], [25, 23], [26, 25], [24, 26], [24, 27]

3) Since you don't have information for every second, some of each chunk of speed values may be over time intervals that are not really interval_length seconds long. So, let's restrict our calculations only to those. For 5 seconds, it happens that no data needs to be dropped and the intermediate result is the same as step 2.

4) Now we can take the rolling average:

 # => [13.8, 18.2, 20.0, 22.2, 22.8, 22.6, 21.4, 19.4, 16.8, 15.4, 15.0, 11.8, 12.0, 12.8, 14.0, 16.0, 21.2, 23.0, 24.0]

5) And the maximum thereof:

# => 24.0

Again, I'm not sure how this will fare on a really large data set, but it might be worth a try.

O-I
  • 1,535
  • 1
  • 13
  • 13
  • This makes a lot of sense - it works for interval_lengths up to 8, but then breaks at 9 and above. I originally thought it could be due to the missing 9 second point, but it makes it past the missing 4 second point. Any idea why? It does seem so much faster. – user4740054 Apr 08 '15 at 00:38
  • Interestingly enough, for this data set sectioned off at 9, every actual interval length is 10 (0 to 10, 1 to 11, 2 to 12, etc.), so when we select, we get an empty array. Let me see if I can tweak it a bit. – O-I Apr 08 '15 at 01:20
  • The more that I look at this, the more I think it probably won't work. For example - the 10 second average speed starting at second 1 needs to look at the speed points from 1 second to 11 seconds (10 seconds, but only 9 data points in this case). I'm trying to move away from finding the index of those points since that seems to be what's slowing things down but that might not be possible. – user4740054 Apr 08 '15 at 03:16