0

Looking for some help understanding how to optimize some array processing, targeting some postgis compatible data types. The input data looks like this:

{
    "items": [
        {
            "id": 10000,
            "coords": [[644, 1347, 1], [653, 1353, 1], [637, 1358, 1], [633, 1362, 1]]
        }
]
}

Here is what I've tried:

import json
import numpy
import ppygis
import time

start_time = time.time()
with open('example.json') as fp:
    d = json.load(fp)

print "file load time:"
print time.time() - start_time

"""
standard python
"""

start_time = time.time()
py_array = d['items'][0]['coords']

print "array creation:"
print time.time() - start_time

start_time = time.time()
a = [' '.join(map(str, c)) for c in py_array]
b = '(' + ') ('.join(map(str, a)) + ')'

print "python array string processing time:"
print time.time() - start_time

start_time = time.time()
c = [ppygis.Point(p[0], p[1], p[2]) for p in py_array]

print "python array ppygis:"
print time.time() - start_time

"""
numpy
"""

start_time = time.time()
numpy_array = numpy.array(d['items'][0]['coords'])

print "numpy array creation:"
print time.time() - start_time

start_time = time.time()
a = [' '.join(map(str, c)) for c in numpy_array]
b = '(' + ') ('.join(map(str, a)) + ')'

print "numpy array string processing time:"
print time.time() - start_time

start_time = time.time()
c = [ppygis.Point(p[0], p[1], p[2]) for p in numpy_array]

print "numpy array ppygis:"
print time.time() - start_time

This is the output:

file load time:
8.29696655273e-05
array creation:
2.86102294922e-06
python array string processing time:
1.09672546387e-05
python array ppygis:
8.10623168945e-06
numpy array creation:
1.31130218506e-05
numpy array string processing time:
0.000116109848022
numpy array ppygis:
3.60012054443e-05

Why are the operations using the numpy arrays so much slower than the normal python array?

jfarr
  • 37
  • 1
  • 2

1 Answers1

0

As general rule, iterative operations on numpy arrays is slower than equivalent ones on lists. In part that's because creating an array from a list takes time, whether it's the initial creation or some intermediate step. Numpy arrays gain their speed advantage when you perform compiled operations on them - ones where the iteration takes place at compiled speeds rather than interpreted ones.

In your example, the dictionary source isn't important

In [372]: arr = d['items'][0]['coords']
In [373]: arr
Out[373]: [[644, 1347, 1], [653, 1353, 1], [637, 1358, 1], [633, 1362, 1]]
In [374]: narr=np.array(arr)
In [375]: narr.shape
Out[375]: (4, 3)
In [376]: narr
Out[376]: 
array([[ 644, 1347,    1],
       [ 653, 1353,    1],
       [ 637, 1358,    1],
       [ 633, 1362,    1]])

Take the simple task of adding 1 to all values:

In [379]: timeit narr+1   # compiled operation
The slowest run took 16.25 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.09 µs per loop
In [380]: timeit [[i+1 for i in j] for j in arr]
100000 loops, best of 3: 5.67 µs per loop
In [381]: timeit np.array([[i+1 for i in j] for j in narr])
The slowest run took 28.05 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 32.3 µs per loop

Your string formatting case:

In [383]: a=[' '.join(map(str, c)) for c in arr]
In [384]: a
Out[384]: ['644 1347 1', '653 1353 1', '637 1358 1', '633 1362 1']
In [385]: b='('+') ('.join(map(str,a)) + ')'
In [386]: b
Out[386]: '(644 1347 1) (653 1353 1) (637 1358 1) (633 1362 1)'

or with list comprehensions instead of map:

In [396]: '(%s)'%') ('.join([' '.join([str(i) for i in c]) for c in arr])
Out[396]: '(644 1347 1) (653 1353 1) (637 1358 1) (633 1362 1)'

Performing this same action on narr.tolist() is nearly as good, and better than iterating on narr directly.

[c for c on narr] produces a list of 4 arrays; [str(i) for i in c] then requires iterating on each of those subarrays.

Making a list of Point objects from a numpy array is a poor use of the structure.

You could make a structured array, and access 'x' values for a whole set of points. That's the key - use arrays when you want to work with the whole structure, or at least whole rows and columns.

In [436]: parr = narr.view([('x', '<i4'), ('y', '<i4'), ('z', '<i4')]).squeeze()

In [437]: parr
Out[437]: 
array([(644, 1347, 1), (653, 1353, 1), (637, 1358, 1), (633, 1362, 1)], 
      dtype=[('x', '<i4'), ('y', '<i4'), ('z', '<i4')])

In [438]: parr[0]
Out[438]: (644, 1347, 1)

In [439]: parr['x']
Out[439]: array([644, 653, 637, 633])
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Thanks for the detailed info, I'm still not getting the performance I'd like but feel like I'm making progress. – jfarr Sep 01 '16 at 18:21
  • What do you think of the closure? Does the linked question help? What kind of performance do you expect? One of your cases is string formatting, the other object creation. Neither makes much use of compiled numpy array functionality. – hpaulj Sep 01 '16 at 18:46
  • The requirements changed while I was working on this, my source data will be in shape files now :) I'd like to return to this for personal development sometime, but right now I don't have the bandwidth. I'll update this thread when I'm working on this again, thanks for the help! – jfarr Sep 18 '16 at 02:00