0

I'm completely new to numpy and unable to find a solution. I have a 2d list of floating point numbers in python like:

list1[0..8][0..2] 

Where e.g.:

print(list1[0][0])
> 0.1122233784

Now I want to find min and max values:

b1 = numpy.array(list1)
list1MinX, list1MinY, list1MinZ = b1.min(axis=0)
list1MaxX, list1MaxY, list1MaxZ = b1.max(axis=0)

I need to do this about a million times in a loop.

It works correctly, but it's about 3x slower than my previous native python approach.

(1:15 min[numpy] vs 0:25 min[native])

What am I doing wrong? I've read that the list conversion could be the problem, but I don't know how to do it better.

EDIT

As request some non-pseudo code, although in my script the list is created in another way.

import numpy
import random

def moonPositionNow():
   #assume we read like from a file, line by line
   #nextChunk = readNextLine()
   #the file is build like this
   #x-coord
   #y-coord
   #z-coord
   #x-coord
   #...
   #but we don't have that data here, so as a **placeholder** we return a random number
   nextChunk = random.random()
   return nextChunk  

for w in range(1000000):        
    list1 = [[moonPositionNow() for i in range(3)] for j in range(9)]
    b1 = numpy.array(list1)
    list1MinX, list1MinY, list1MinZ = b1.min(axis=0)
    list1MaxX, list1MaxY, list1MaxZ = b1.max(axis=0)        

#Print out results   

Although the list creation may be a bottle neck here I guaranty in the original code it's not the problem.

EDIT2:

Updated the example code to clarify, I don't need a numpy array of random numbers.

bortran
  • 159
  • 1
  • 12
  • 1
    Why not use the native approach? And could you provide some copiable test code for us to work with? – Gullydwarf Feb 05 '15 at 09:43
  • The purpose of using numpy was to speed up the code. The native code is quite optimized, yet I thought numpy could squeeze out a bit more. – bortran Feb 05 '15 at 10:08

2 Answers2

2

Since your data is available as a Python list it seems reasonable to me that a native implementation (which likely calls some optimized C code) could be faster than converting to numpy first and then calling optimized C code.

You basically loop over your data twice: once for converting the python objects to numpy arrays, and once for computing the maximum or minimum. The native implementation (I assume it is something like calling min/max on the Python list) only needs to loop over the data once.

Furthermore, it seems that numpy's min/max functions are surprisingly slow: https://stackoverflow.com/a/12200671/3005167

Community
  • 1
  • 1
MB-F
  • 22,770
  • 4
  • 61
  • 116
  • I guess you're right. I mean it's not even calling C code, I just calculate min and max of actually 2 lists (sorry didn't mention) and all 3 dimensions in one loop, which I can also reduce to n-1 by assign the first value of the list to min/max. That vs. 2x conversion + 2x np.min/np.max for 3 dimensions. – bortran Feb 05 '15 at 12:55
1

The problem arises because you are passing a python list to a numpy function. The numpy function is significantly faster if you pass a numpy array as the argument.

#Create numpy numbers
nptest = np.random.uniform(size=(10000, 10))
#Create a native python list
listtest = list(nptest)
#Compare performance
%timeit np.min(nptest, axis=0)
%timeit np.min(listtest, axis=0)

Output

1000 loops, best of 3: 394 µs per loop
100 loops, best of 3: 20 ms per loop

EDIT: Added example on how to evaluate a cost function over a grid.

The following evaluates a quadratic cost function over a grid and then takes the minimum along the first axis. In particular, np.meshgrid is your friend.

def cost_function(x, y):
    return x ** 2 + y ** 2

x = linspace(-1, 1)
y = linspace(-1, 1)

def eval_python(x, y):
    matrix = [cost_function(_x, _y) for _x in x for _y in y]
    return np.min(matrix, axis=0)

def eval_numpy(x, y):
    xx, yy = np.meshgrid(x, y)
    matrix = cost_function(xx, yy)
    return np.min(matrix, axis=0)

%timeit eval_python(x, y)
%timeit eval_numpy(x, y)

Output 100 loops, best of 3: 13.9 ms per loop 10000 loops, best of 3: 136 µs per loop

Finally, if you cannot cast your problem in this form, you can preallocated the memory and then fill in each element.

matrix = np.empty((num_x, num_y))
for i in range(num_x):
    for j in range(num_y):
        matrix[i, j] = cost_function(i, j)
Till Hoffmann
  • 9,479
  • 6
  • 46
  • 64
  • Thanks, but the problem is that I have a native python list and need to work with that initially. So could a more efficient conversion be done? – bortran Feb 05 '15 at 10:13
  • Can you build a numpy array straight-away instead of building a python list first? – Till Hoffmann Feb 05 '15 at 10:15
  • If you take this as example, how do I do that? list1 = [[random.random() for i in range(3)] for j in range(9)] EDIT: but not with random, instead with say a myCustomFunction(). – bortran Feb 05 '15 at 10:17
  • I tried b1 = numpy.array([[myCustomFuction() for i in range(3)] for j in range(9)]) but that seemed to be even slower. – bortran Feb 05 '15 at 10:26
  • I'm really sorry, but Idon't know how to apply this meshgrid method to my problem. My example function moonPositionNow() doesn't depend on parametric values like moonPositionNow(x, y). I also don't have access to moonPositionNow(), it will just return the needed static data every time it's called. – bortran Feb 05 '15 at 10:58
  • Another example above that does not make use of meshgrid. – Till Hoffmann Feb 05 '15 at 11:20
  • Thanks again for the effort, Till. I just tried the way with the empty matrix, it seems a bit faster than my first numpy approach but still takes almost twice the time of the native python approach. – bortran Feb 05 '15 at 11:54
  • How did you implement the native python approach? If you really are just reading data from a file, have a look at `np.loadtxt`. – Till Hoffmann Feb 05 '15 at 11:59
  • No I don't load data, it was just an example for the way the function works. It's a third-party API function that returns the data only that way, chunk by chunk. – bortran Feb 05 '15 at 12:12