2

If I have a list of numbers or objects in a list like l = [3,5,3,6,47,89]. We can calculate the minimum, maximum and average using following python code

minimum = min(l)
maximum = max(l)
avg = sum(l) / len(l)

Since all involve iterating the entire list, it is slow for large lists and lot of code.Is there any python module which can calculate all these values together?

Sirish
  • 9,183
  • 22
  • 72
  • 107
  • Under the hood, Python lists are arrays, with O(1) access. So it's not particularly slow to iterate over them. Just saying. – unwind Mar 04 '14 at 15:00
  • possible duplicate of [numpy: function for simultaneous max() and min()](http://stackoverflow.com/questions/12200580/numpy-function-for-simultaneous-max-and-min) – YXD Mar 04 '14 at 15:00
  • Any c++ programmer would puke at the idea of traversing 3 times. Which is why we have `std::minmax` now and similar goodies. – Bathsheba Mar 04 '14 at 15:02
  • 2
    @unwind : How can unordered arrays have O(1) for finding minimum and maximum. – Sirish Mar 04 '14 at 15:03
  • 1
    @Sirish I didn't say that, I said that *iterating over a list* isn't too slow, since each element is accessed in O(1) time. – unwind Mar 04 '14 at 15:04
  • 1
    have you actually tried timing it to see how slow it is? optimizations that make sense in C, don't always make sense in python. – Corley Brigman Mar 04 '14 at 15:05
  • Sorry for the trigger happy duplicate marking. This is not a numpy question. – YXD Mar 04 '14 at 15:06
  • 2
    Interestingly, I've found that iterating once yourself is [slower](http://pastebin.com/Cj28nxeB) than letting the built-in functions iterate three times separately. – Kevin Mar 04 '14 at 15:12
  • 3
    on the optimization analysis: traversing multiple times only really hurts you when traversing items is expensive, or you can reuse partial computations per entry. in this case, you have to do the same 2 comparisons + (up to) 3 assignments per item, and they are independent, whether you do a single traversal or multiple traversal. since lookups are O(1), traversals are not expensive, and the benefits from using built-ins (which are C primitives) greatly outweigh any possible benefits from a different algorithm implemented in pure python. – Corley Brigman Mar 04 '14 at 15:15
  • 1
    @CorleyBrigman You only have to do up to 2 assignments and 1 or 2 comparisons per item; once seeded with the first element of the list, no other value can simultaneously be a new min and a new max, so in a randomized list, you should be able to get away with less than 2 (on average) comparisons per element (and 1.5 assignments as well, as you only have to accumulate on each item, while roughly 1/2 the elements of the list should be neither a new min or max). In this case though, the advantages of builtins still appear to outweigh any optimizations from folding the operations together. – Silas Ray Mar 04 '14 at 15:40
  • 1
    @SilasRay - all that is true. though you probably don't save a lot of comparisons - on a truly randomly-distributed list, you'd get far less than 1/2 the elements being either a new min or max. on a sample run here of 10000 items randomly sampled from a space of about 10^7, an iterative solution only stepped the max 10 times. – Corley Brigman Mar 04 '14 at 15:49
  • 1
    @CorleyBrigman Interesting, though I guess now that I think about it more, that does make sense. I guess ultimately, what it comes down to is the nature of `l` from the OP. There might be optimizations if the data can be expected to follow trends, but not if it is really randomized. – Silas Ray Mar 04 '14 at 15:57
  • Using cython gave about an order of magnitude speedup, compared to using the 3 builtins. – M4rtini Mar 04 '14 at 16:07

2 Answers2

3

If you have pandas installed, you can do something like this:

import numpy as np
import pandas
s = pandas.Series(np.random.normal(size=37))
stats = s.describe()

stats will be a another series that behaves like a dictionary:

print(stats)
count    37.000000
mean      0.072138
std       0.932000
min      -1.267888
25%      -0.688728
50%      -0.048624
75%       0.784244
max       2.501713
dtype: float64

stats['max']
2.501713

...etc. However, I don't recommend this unless you're striving simply for concise code. Here's why:

%%timeit
stats = s.describe()
# 100 loops, best of 3: 1.44 ms per loop

%%timeit
mymin = min(s)
mymax = max(s)
myavg = sum(s)/len(s)
# 10000 loops, best of 3: 89.5 µs per loop

I just can't imagine that you'll be able to squeeze any more performance out of the built-in functions with your own implementations (barring some cython voodoo, maybe).

Paul H
  • 65,268
  • 20
  • 159
  • 136
3

Cython function:

@cython.boundscheck(False)
@cython.wraparound(False)
def minmaxAvg(list x):

    cdef int i
    cdef int _min, _max, total
    _min = x[0]
    _max = x[0]
    total = 0
    for i in x:
        if i < _min: _min = i 
        elif i > _max: _max = i 
        total += i
    return _min, _max, total/len(x)

pure python function to compare against:

def builtinfuncs(x):
    a = min(x)
    b = max(x)
    avg = sum(x) / len(x)
    return a,b,avg


In [16]: x = [random.randint(0,1000) for _ in range(10000)]

In [17]: %timeit minmaxAvg(x)
10000 loops, best of 3: 34 µs per loop

In [18]: %timeit frob(x)
1000 loops, best of 3: 460 µs per loop

Disclaimer:
- Speed result from cython will be dependent on computer hardware.
- Not as flexible and foolproof as using builtins. You would have to change the function to handle anything but integers for example.
- Before going down this path, you should ask yourself if this operation really is a big bottleneck in your application. It's probably not.

M4rtini
  • 13,186
  • 4
  • 35
  • 42