3

I have a function that works something like this:

def Function(x):
   a = random.random()
   b = random.random()
   c = OtherFunctionThatReturnsAThreeColumnArray()
   results = np.zeros((1,5))
   results[0,0] = a
   results[0,1] = b
   results[0,2] = c[-1,0]
   results[0,3] = c[-1,1]
   results[0,4] = c[-1,2]
   return results

What I'm trying to do is run this function many, many times, appending the returned one row, 5 column results to a running data set. But the append function, and a for-loop are both ruinously inefficient as I understand it, and I'm both trying to improve my code and the number of runs is going to be large enough that that kind of inefficiency isn't doing me any favors.

Whats the best way to do the following such that it induces the least overhead:

  1. Create a new numpy array to hold the results
  2. Insert the results of N calls of that function into the array in 1?
Fomite
  • 2,213
  • 7
  • 30
  • 46
  • Could you generate all the `a`s, `b`s and `c`s at once? In your toy example, you could do `a, b = np.random.rand(2, n)` for example. If you can do something similar with `c`, then `hstack`ing those 3 arrays, possibly transposed, will beat by a lot your accepted answer. – Jaime Feb 24 '13 at 15:40
  • @Jamie No. While I could generate A and B all at once, c calls a function that takes a very, very long time, and needs a and c. – Fomite Feb 26 '13 at 00:38

1 Answers1

2

You're correct in thinking that numpy.append or numpy.concatenate are going to be expensive if repeated many times (this is to do with numpy declaring a new array for the two previous arrays).

The best suggestion (If you know how much space you're going to need in total) would be to declare that before you run your routine, and then just put the results in place as they become available.

If you're going to run this nrows times, then

results = np.zeros([nrows, 5])

and then add your results

def function(x, i, results):
    <.. snip ..>
    results[i,0] = a
    results[i,1] = b
    results[i,2] = c[-1,0]
    results[i,3] = c[-1,1]
    results[0,4] = c[-1,2]

Of course, if you don't know how many times you're going to be running function this won't work. In that case, I'd suggest a less elegant approach;

  1. Declare a possibly large results array and add to results[i, x] as above (keeping track of i and the size of results.

  2. When you reach the size of results, then do the numpy.append (or concatenate) on a new array. This is less bad than appending repetitively and shouldn't destroy performance - but you will have to write some wrapper code.

There are other ideas you could pursue. Off the top of my head you could

  1. Write the results to disk, depending on the speed of OtherFunctionThatReturnsAThreeColumnArray and the size of your data this may not be too daft an idea.

  2. Save your results in a list comprehension (forgetting numpy until after the run). If function returned (a, b, c) not results;

results = [function(x) for x in my_data]

and now do some shuffling to get results into the form you need.

danodonovan
  • 19,636
  • 10
  • 70
  • 78
  • How does the function then get called? – Fomite Feb 24 '13 at 13:20
  • You call `function(x, i, results)` passing in `i` and `results` which you have already declared. `results` is something like `results = np.zeros((1000, 5))` for 1000 runs. – danodonovan Feb 24 '13 at 13:24
  • I presume writing the results to disk works best if OtherFunction is slow? So the processor isn't waiting for the disk to free up? – Fomite Feb 24 '13 at 13:24
  • Exactly - even better would be to spawn a separate process to do the writing whilst the worker process is running, then you have no waiting on either side. – danodonovan Feb 24 '13 at 13:26
  • And the function call above - within a for-loop? – Fomite Feb 24 '13 at 13:26
  • A well constructed `for` loop would be fine - you at least know how many times it will run and so can declare `results` accordingly. – danodonovan Feb 24 '13 at 13:28
  • One last bit - within your suggested function, I'm setting results[i,X] directly, so there's nothing necessarily to return, correct? – Fomite Feb 24 '13 at 13:30
  • `results` is added to in place, and your calling function will have the same `results` array so no need to return cf "[Python is Pass By Reference](http://stackoverflow.com/questions/534375/passing-values-in-python)" – danodonovan Feb 24 '13 at 13:36
  • It doesn´t seem a very numpythonic approach. The least would be to do `results[i, 2:] = c [-1, :3]`. But the key should be to get rid of the python looping. – Jaime Feb 24 '13 at 15:44