1

The following code does not generate what I want; To convert each tuple inside a tuple to a Numpy array therefore giving me the option to retrieve the values with multiple indexes.

import numpy as np
a=np.asarray([[1,2,3],[2,3,4,5]])
print a

Output is the error:

IndexError: too many indices 

However what I want it to retrieve is 1, because the first tuples first tuples first values is one. How should I make such a conversion to happen?

Update: Interestingly when I try something like:

a=np.asarray([np.asarray([1,2,3]),np.asarray([2,3,4,5])])
b=np.asarray([np.asarray([1,2,3]),np.asarray([2,3,4,5])])
print np.multiply(a,b)

That generates the desired output! which is element by element multiplication.

[array([1, 4, 9]) array([ 4,  9, 16, 25])]
Cupitor
  • 11,007
  • 19
  • 65
  • 91

2 Answers2

3

You can't convert your example directly to a NumPy array because you have differing lengths. The result you are getting is a 1d NumPy array which holds Python list objects. I've seen what you're trying to do referred to as a jagged array but not sure if that's any kind of official term.

You could pad the elements with zeros or use a sparse matrix, or simply not convert to NumPy. Depends on your overall goal.

To get you started here's how you can set up a masked array from a jagged array and compute the sum along an axis. Someone who uses this module more than me may be able to suggest something more efficient or idiomatic:

>>> a = np.array([[[1,2,3],[2,3,4,5], [2, 2]],[[3,4,5,6,7],[1],[2,3,10]]])
>>> D = max(len(x) for x in y for y in a)
>>> padded = [[x + [0] * (D-len(x)) for x in y] for y in a]
>>> mask = [[[0] * len(x) + [1] * (D-len(x)) for x in y] for y in a]
>>> result = np.ma.masked_array(padded, np.array(mask, dtype=np.bool))
>>> result
masked_array(data =
 [[[1 2 3 -- --]
  [2 3 4 5 --]
  [2 2 -- -- --]]

 [[3 4 5 6 7]
  [1 -- -- -- --]
  [2 3 10 -- --]]],
             mask =
 [[[False False False  True  True]
  [False False False False  True]
  [False False  True  True  True]]

 [[False False False False False]
  [False  True  True  True  True]
  [False False False  True  True]]],
       fill_value = 999999)

>>> np.sum(result, axis=-1)
masked_array(data =
 [[6 14 4]
 [25 1 15]],
             mask =
 [[False False False]
 [False False False]],
       fill_value = 999999)

>>> 
YXD
  • 31,741
  • 15
  • 75
  • 115
  • Interesting. But this thing can really come in handy! At least that is the term in C# terminology. Thanks for the info. Yeah padding is a memory consuming option. True. However see the update... – Cupitor Nov 03 '13 at 18:26
  • 1
    If you can say a bit more about your goal (speed? simplifying some messy code?) and maybe the typical size of your data it'll help with what I and others recommend. Is it mainly that you want to supply a tuple as an index instead of doing `a[1][2][3]` etc? – YXD Nov 03 '13 at 18:40
  • Thank you and sure. I have this array of arrays which will get updated at any time instance(by being multiplied to an array of its own shape which as I mentioned earlier is possible). More specifically my arrays are all in `MxN` shape in the first two dimensions but they differ in the third dimension. Now what I need additionally is to get the sum over the third dimension, i.e. a `MxN` matrix which the element `i,j` corresponds to `np.array(A[i][j])` if my array of depth free would be `A`. If it needs more clarification please ask for it. – Cupitor Nov 03 '13 at 18:47
  • Obviously I don't want to do it with a for loop! – Cupitor Nov 03 '13 at 18:47
  • Unless your data is large enough to cause memory issues I would pad the last axis with zeros and use a masked array. You can [sum over an axis without looping](http://docs.scipy.org/doc/numpy/reference/generated/numpy.ma.sum.html#numpy.ma.sum). – YXD Nov 03 '13 at 18:54
  • Actually I see some possibilities now that I searched more! It seems that einsum might be able to do what want. I don't know yet. The important point is that the summation I want is : `A.B`, i.e. if `A` was the first array and B the second which will be multiplied to it, I want the element wise dot product in third dimension, not exactly the sum!(Summation of `np.multiply(A.B)` in third dimension will be equivalent to dot product in third dimension) – Cupitor Nov 03 '13 at 18:57
  • @jaime, do you have any possible suggestions? Thank you. – Cupitor Nov 03 '13 at 19:01
  • Mr. E, even that doesn't work. I mean np.ma.sum in this case. – Cupitor Nov 03 '13 at 19:08
  • I'm having trouble relating your original description (with update) to the last bit about 'element wise dot product in the third dimension'. Could you add some sample arrays, just the NxMx? ones. Don't try to make arrays of arrays. Lists of arrays, or separate names (`A0`,`A1`...) will be fine. – hpaulj Nov 04 '13 at 07:23
0

If I change your a and b so numpy makes a 2d array, instead of a array of arrays:

In [5]: am=np.asarray([np.asarray([1,2,3,0]),np.asarray([2,3,4,5])])
#array([[1, 2, 3, 0],
#       [2, 3, 4, 5]])
In [7]: bm=np.asarray([np.asarray([1,2,3,0]),np.asarray([2,3,4,5])])

and do timings:

In [10]: timeit np.multiply(a,b)
100000 loops, best of 3: 7.94 us per loop

In [11]: timeit np.multiply(am,bm)
100000 loops, best of 3: 1.89 us per loop

The pure ndarray multiplication is substantially faster. In one case it can jump directly into doing element by element multiplication (at the fast C code level); in the other it is doing general purpose iteration, working with objects rather than simple numbers. It is doing something close to iterating in Python.

In fact if I do that loop explicitly, I get something close to that longer time

al,bl=a.tolist(), b.tolist()
In [21]: timeit np.array([np.multiply(x,y) for x,y in zip(al,bl)])
100000 loops, best of 3: 8.99 us per loop

Now lets look at your 'sum on the last dimension' problem. Notice first that sum (or add.reduce) has not been extended to work with this type of array.

In [37]: timeit am.sum(axis=1)
100000 loops, best of 3: 11.5 us per loop

In [38]: timeit [x.sum() for x in a]
10000 loops, best of 3: 21.5 us per loop

The speed advantage of the ndarray sum isn't as great. sum can be sped up by coding it as a dot product (with np.dot or einsum):

In [42]: timeit np.einsum('ij->i',am)
100000 loops, best of 3: 4.79 us per loop

In [50]: ones=np.array([1,1,1,1])
In [51]: timeit np.dot(am,ones)
100000 loops, best of 3: 2.37 us per loop

In [55]: timeit [np.einsum('j->',x) for x in a]
100000 loops, best of 3: 12.3 us per loop

In [64]: c=np.asarray([np.asarray([1,1,1]),np.asarray([1,1,1,1])])   
In [65]: timeit [np.dot(x,y) for x,y in zip(a,c)]
100000 loops, best of 3: 8.12 us per loop

So while it is possible to construct ragged arrays (or array of arrays), they don't have a substantial speed advantage over lists of arrays. The fast numpy array operations do not, in general, work with elements that are general purpose Python objects (dtype=object).

hpaulj
  • 221,503
  • 14
  • 230
  • 353