7

Is there a way to vectorize an operation that takes several numpy arrays and puts them into a list of dictionaries?

Here's a simplified example. The real scenario might involve more arrays and more dictionary keys.

import numpy as np
x = np.arange(10)
y = np.arange(10, 20)
z = np.arange(100, 110)

print [dict(x=x[ii], y=y[ii], z=z[ii]) for ii in xrange(10)]

I might have thousands or hundreds of thousands of iterations in the xrange call. All the manipulation to create x, y, and z is vectorized (my example is not as simple as above). So, there's only 1 for loop left to get rid of, which I expect would result in huge speed ups.

I've tried using map with a function to create the dict and all sorts of other work arounds. It seems the Python for loop is the slow part (as usual). I'm sort of stuck to using dictionaries because of a pre-existing API requirement. However, solutions without dicts and record arrays or something would be interesting to see, but ultimately I don't think that will work with the existing API.

Divakar
  • 218,885
  • 19
  • 262
  • 358
durden2.0
  • 9,222
  • 9
  • 44
  • 57
  • `z=z[ii]`, good catch! – durden2.0 Nov 03 '16 at 09:49
  • `[dict(x=x_, y=y_, z=z_) for x_, y_, z_ in zip(x, y, z)]` this is vectorised as far as pure Python goes. – Eli Korvigo Nov 03 '16 at 09:50
  • Did you try with a list and dic comprehension ? is it too slow ? – MMF Nov 03 '16 at 09:50
  • Do you want to get all the combinaisons possible for the values ? – MMF Nov 03 '16 at 09:51
  • Yes, I know it cannot be vectorized in pure python. I was hoping for some vectorization with numpy or even cython, etc. – durden2.0 Nov 03 '16 at 09:53
  • @durden2.0 list comprehensions are just a fancy way of writing `map/filter`, so this is vectorisation. – Eli Korvigo Nov 03 '16 at 10:05
  • "x=x[ii]" assuming you are doing that for every array item, you are actually duplicating the complete array in a less efficient data structure (dictionnary). That will cause tons of memory allocations, since there is no way to preallocate the dictionnary (is it?). If your data is large enough, there is no solution with the current API. – Balzola Nov 03 '16 at 10:08
  • @EliKorvigo You're right, but I want to try to take the loop out of python, i.e. do the vectorization at the numpy level. – durden2.0 Nov 03 '16 at 10:08
  • @Balzola You're right, there is a lot of duplication but the memory is not the issue here. I'm trading memory for readability and having a name associated with the data. I'm not worried about memory usage, focused on CPU. – durden2.0 Nov 03 '16 at 10:13
  • 1
    I mean that the problem is maybe not the presence or absence of a loop but the construction of the list that causes repeated memory allocations as it grows. – Balzola Nov 03 '16 at 10:15
  • @Balzola Good point each `x[ii]` allocates another object, so that could be optimized in the loop to reduce the allocations. – durden2.0 Nov 03 '16 at 10:18
  • 1
    @durden2.0 FYI, the `for` in listcomps have little to do with Python's general `for`. The former actually has a lower-level implementation and is faster than the later. – Eli Korvigo Nov 03 '16 at 11:29
  • @EliKorvigo Good point. That's why I was trying `map` and list comprehensions first to speed things up. My scenario just doesn't lend itself very well to that. – durden2.0 Nov 03 '16 at 11:34
  • There's no compiled `numpy` function to create dictionaries. Those are python objects. The fast numpy code operates on array data buffers. – hpaulj Nov 03 '16 at 12:01
  • @hpaulj Yes you're right. That's why I mentioned I would be interested in solutions with record arrays too just for comparison. Again, the bottleneck here is not a dictionary, it's object allocations and for loops. Numpy can usually get rid of for loops which is what I was exploring. – durden2.0 Nov 03 '16 at 16:12

3 Answers3

3

With your small example, I'm having trouble getting anything faster than the combination of list and dictionary comprehensions

In [105]: timeit [{'x':i, 'y':j, 'z':k} for i,j,k in zip(x,y,z)]
100000 loops, best of 3: 15.5 µs per loop
In [106]: timeit [{'key':{'x':i, 'y':j, 'z':k}} for i,j,k in zip(x,y,z)]
10000 loops, best of 3: 37.3 µs per loop

The alternatives that use array concatenation to join the arrays before partitioning are slower.

In [108]: timeit [{'x':x_, 'y':y_, 'z':z_} for x_, y_, z_ in np.column_stack((x,y,z))]
....
10000 loops, best of 3: 58.2 µs per loop

=======================

A structured array is easiest with recfunctions:

In [109]: from numpy.lib import recfunctions
In [112]: M=recfunctions.merge_arrays((x,y,z))
In [113]: M.dtype.names=['x','y','z']
In [114]: M
Out[114]: 
array([(0, 10, 100), (1, 11, 101), (2, 12, 102), (3, 13, 103),
       (4, 14, 104), (5, 15, 105), (6, 16, 106), (7, 17, 107),
       (8, 18, 108), (9, 19, 109)], 
      dtype=[('x', '<i4'), ('y', '<i4'), ('z', '<i4')])
In [115]: M['x']
Out[115]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Time it much slower, but if you want to access all the x values at once, it's much better than fetching them from all the dictionaries.

np.rec.fromarrays((x,y,z),names=['x','y','z'])

produces a recarray with given names. About the same speed.

I could also construct an empty array of the right dtype and shape and copy the arrays to it. That's probably as fast as this merge but more complicated to describe.

I'd suggest optimizing the data structure for use/access rather than construction speed. Generally you construct it once, and use it many times.

============

In [125]: dt=np.dtype([('x',x.dtype),('y',y.dtype),('z',z.dtype)])
In [126]: xyz=np.zeros(x.shape,dtype=dt)
In [127]: xyz['x']=x; xyz['y']=y; xyz['z']=z
# or for n,d in zip(xyz.dtype.names, (x,y,z)): xyz[n] = d
In [128]: xyz
Out[128]: 
array([(0, 10, 100), (1, 11, 101), (2, 12, 102), (3, 13, 103),
       (4, 14, 104), (5, 15, 105), (6, 16, 106), (7, 17, 107),
       (8, 18, 108), (9, 19, 109)], 
      dtype=[('x', '<i4'), ('y', '<i4'), ('z', '<i4')])
hpaulj
  • 221,503
  • 14
  • 230
  • 353
2

Here is one (Num)?Pythonic way:

In [18]: names = np.array(['x', 'y', 'z'])
In [38]: map(dict, np.dstack((np.repeat(names[None, :], 10, axis=0), np.column_stack((x, y, z)))))
Out[38]: 
[{'x': '0', 'y': '10', 'z': '100'},
 {'x': '1', 'y': '11', 'z': '101'},
 {'x': '2', 'y': '12', 'z': '102'},
 {'x': '3', 'y': '13', 'z': '103'},
 {'x': '4', 'y': '14', 'z': '104'},
 {'x': '5', 'y': '15', 'z': '105'},
 {'x': '6', 'y': '16', 'z': '106'},
 {'x': '7', 'y': '17', 'z': '107'},
 {'x': '8', 'y': '18', 'z': '108'},
 {'x': '9', 'y': '19', 'z': '109'}]

Also, note that if you don't need all of the dictionaries at once, you can simply create a generator and access to each item on demand.

(dict(x=x[ii], y=y[ii], z=z[ii]) for ii in xrange(10))

If you want a nested dictionary, I suggest a list comprehension:

In [88]: inner = np.dstack((np.repeat(names[None, :], 10, axis=0), np.column_stack((x, y))))

In [89]: [{'connection': d} for d in map(dict, inner)]
Out[89]: 
[{'connection': {'x': '0', 'y': '10'}},
 {'connection': {'x': '1', 'y': '11'}},
 {'connection': {'x': '2', 'y': '12'}},
 {'connection': {'x': '3', 'y': '13'}},
 {'connection': {'x': '4', 'y': '14'}},
 {'connection': {'x': '5', 'y': '15'}},
 {'connection': {'x': '6', 'y': '16'}},
 {'connection': {'x': '7', 'y': '17'}},
 {'connection': {'x': '8', 'y': '18'}},
 {'connection': {'x': '9', 'y': '19'}}]
Mazdak
  • 105,000
  • 18
  • 159
  • 188
  • Nice! What about if I need to create nested dictionaries. For example, `{'connection': {'x': x[ii], 'y': y[ii]}}`. I realized I simplified my question a bit too much for my scenario. – durden2.0 Nov 03 '16 at 10:01
  • @durden2.0 You mean, you want a nested dictionary for all the items? and without `z`? or it's optional? – Mazdak Nov 03 '16 at 10:07
  • Yes nested dictionary for all the items, `z` is required in my solution but doesn't really matter too much for the solution itself because the key is a nested dictionary without doing a python for loop. – durden2.0 Nov 03 '16 at 10:09
  • 2
    @durden2.0 List comprehension's `for` is not like pythons `for` loop, cause its iteration has been implemented in C. But if you want to do this in Numpy ater all, since you want a python object, your code needs to have interaction with the upper level (python) and this means you cant write a pure numpythonic code. – Mazdak Nov 03 '16 at 10:18
  • @Kasramvd `"List comprehension's for is not like pythons for loop, cause its iteration has been implemented in C"`, really? [I wonder why they are slow then?](http://stackoverflow.com/questions/33551962/list-comprehension-with-numpy-arrays-bad-practice). One more relevant link - http://stackoverflow.com/questions/22108488/are-list-comprehensions-and-functional-functions-faster-than-for-loops – Divakar Nov 03 '16 at 10:27
  • @Divakar Yes, but it doesn't mean that it's as fast as pure `C`, certainly this code needs to be translate from python, and during this process there are a lot of push and pop in stack, function calls and/or etc. See http://stackoverflow.com/questions/30245397/why-is-list-comprehension-so-faster/30245465#30245465 – Mazdak Nov 03 '16 at 10:30
  • I prefer this solution because it doesn't require pandas and it's still concise. – durden2.0 Nov 03 '16 at 11:35
  • My guess is that dict creation times dominate here, and drown out variations in loop mechanism times. But we'd actually have run some timings. – hpaulj Nov 03 '16 at 11:45
  • Note that the values are strings, not integers. That's due to joining names with values, producing an array of strings. – hpaulj Nov 03 '16 at 16:51
  • @hpaulj Yes, dict creation is the primary bottleneck. – Mazdak Nov 03 '16 at 18:01
1

Here's an approach using a mix of NumPy and Pandas -

# Stack into columns & create a pandas dataframe with appropriate col names
a = np.column_stack((x.ravel(),y.ravel(),z.ravel()))
df = pd.DataFrame(a,columns=[['x','y','z']])

# Convert to list of dicts
out = df.T.to_dict().values()

Sample run -

In [52]: x
Out[52]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [53]: y
Out[53]: array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

In [54]: z
Out[54]: array([100, 101, 102, 103, 104, 105, 106, 107, 108, 109])

In [55]: out
Out[55]: 
[{'x': 0, 'y': 10, 'z': 100},
 {'x': 1, 'y': 11, 'z': 101},
 {'x': 2, 'y': 12, 'z': 102},
 {'x': 3, 'y': 13, 'z': 103},
 {'x': 4, 'y': 14, 'z': 104},
 {'x': 5, 'y': 15, 'z': 105},
 {'x': 6, 'y': 16, 'z': 106},
 {'x': 7, 'y': 17, 'z': 107},
 {'x': 8, 'y': 18, 'z': 108},
 {'x': 9, 'y': 19, 'z': 109}]
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Clever solution! I simplified my setup, but in reality I'm creating a nested dictionary, so I'll try to tweak this a bit. For example, the list of dicts I'm returning is actually like `[{{'connection': {'xy': x[ii], 'yy': y[ii]}}]`. – durden2.0 Nov 03 '16 at 09:59
  • @durden2.0 Could you edit the loop comprehension code in the question for that requirement? – Divakar Nov 03 '16 at 10:12