0

I just want to double-check if I am trying to do the impossible.

Question on nditer()

I have an np.array of lists like.

myArray = np.array(['A','B'],['A','C'],['B','C'])

This yields to

array([['A','B'],
       ['A','C'],
       ['B','C']],dtype='<U7')

I want to iterate over it with nditer(), because I will have way more lists in above example - thus I need the speed behind numpy code.

Unfortunately nditer accesses the elements in the lists and not the lists. I have tried a view flags and op_dtypes, but it just does not work out. So question is: Is it possible to access the lists with nditer rather than a for-loop? I hope I am not trying to do the impossible here, but the keywords iterate,list,nditer and numpy lead to iteration of lists and not lists as list-element when googling.

with nditer(myArray,flags=['tried a view'],op_dtypes=list) as comb:
    for i in comb:
        print(i)

This yields to

A
B
A
C
B
C

But I need

['A','B']
['A','C']
['B','C']
user3386109
  • 34,287
  • 7
  • 49
  • 68
  • 2
    If you have two different questions, post them as two different questions with corresponding relevant tags. – Eugene Sh. Aug 30 '22 at 17:56
  • 1
    This is not an "array of lists", that is an array of unicode strings, *there are no lists involved here* – juanpa.arrivillaga Aug 30 '22 at 18:02
  • 1
    Note, it really doesn't make sense to use numpy here to begin with. – juanpa.arrivillaga Aug 30 '22 at 18:03
  • @EugeneSh. ok, I did that. Thought it would give context on my need to use numpy – TheArmbreaker Aug 30 '22 at 18:08
  • @juanpa.arrivillaga Doesn't myArray has the shape (3,2) ? – TheArmbreaker Aug 30 '22 at 18:08
  • @juanpa.arrivillage when I request i for i in myArray in a for-loop, I get the list. – TheArmbreaker Aug 30 '22 at 18:10
  • 1
    My first reaction on seeing a `nditer` question is - DON'T. The `nditer` docs need stronger disclaimers. It does not help with performance, so it is seldom worth the effort required to get it working. – hpaulj Aug 30 '22 at 18:11
  • 2
    Why are you using `nditer` at all? `for i in myArray:` gives exactly the output you want. – Tim Roberts Aug 30 '22 at 18:12
  • Yes, it does. And no, you do not get a list, iterating directly over a numpy array iterates over the first dimension, so if you have an array of shape (x,y,z), then it would give you x arrays of shape (y, z) – juanpa.arrivillaga Aug 30 '22 at 18:12
  • `myArray` line is wrong! – hpaulj Aug 30 '22 at 18:13
  • In any case, as noted above, why do you want to use nditer anyway? The problem you actually seem to want to solve is to get all the combinations . numpy isn't going to hep here. Note, `itertools.combinations` is *already implemented in C* – juanpa.arrivillaga Aug 30 '22 at 18:13
  • @TimRoberts in my project I would have 8.361.453.672 possible combinations. And I understood that the C-code behind numpy as much faster than Pythons standard loops. – TheArmbreaker Aug 30 '22 at 18:18
  • @juanpa.arrivillaga Thank you, but I understood it would be slower in terms of memory. [link](https://codereview.stackexchange.com/questions/38287/fastest-way-for-working-with-itertools-combinations) – TheArmbreaker Aug 30 '22 at 18:21
  • @TheArmbreaker "slower in terms of memory" doesn't really make sense. But no, itertools.combinations will not be slow. The problem you are running in to is fundamentally one of combinatorial explosion. What exactly do you mean by "crashing"? In any case, `numpy` is not going to fix this issue – juanpa.arrivillaga Aug 30 '22 at 18:23
  • @juanpa.arrivillaga I can execute the code, but when I do not break the iteration after 20 iterations the kernal crashes after about 3 minutes. I will replicate the error message and post it any minute. – TheArmbreaker Aug 30 '22 at 18:39
  • According to [This StackOverflow](https://stackoverflow.com/questions/52570192/why-does-a-large-for-loop-with-10-billion-iterations-take-a-much-longer-time-to) it would be a valid approach to generate the Array of Lists in C, because its much faster. – TheArmbreaker Aug 30 '22 at 18:40
  • Again, **there is no array of lists** – juanpa.arrivillaga Aug 30 '22 at 18:42
  • yes, sorry. Its not an array of lists. It is rows in nd2 array. However, isn't the c approach in the link way more appropriate? – TheArmbreaker Aug 30 '22 at 18:51
  • Have you done the math on this? It seems highly unlikely that iteration is going to be your bottleneck. If you do even 100ms of processing on each row, it will take 20 years to run through 8 billion combinations. – Tim Roberts Aug 30 '22 at 20:10
  • @TimRoberts I am aware of the time complexity and addressed that before I was asked to open another post on the combination topic. The link I copied in the comments is saying that C is much faster to achieve this. My goal is to understand the issue, calculate a view randomly picked examples to show the code working and recommending the full calculation in C. In case it is true that C is faster. – TheArmbreaker Aug 30 '22 at 20:36
  • You need to be realistic. If the iteration part takes 5% of your time. then even if you could make it run infinitely fast, your runtime would only improve by 5%. (This is "Amdahl's Law".) But if you are doing any processing at all on the individual items, then the iteration will be way less than 5%. You are guilty of premature optimization. First, find out what is slow, then make THAT faster. – Tim Roberts Aug 30 '22 at 21:57
  • @TimRoberts Thanks. But isn't knowing that it works but is "20years slow" not exactly what I am trying to do? It is working slow (20years) and I want to find out if its faster in another language than python. – TheArmbreaker Aug 30 '22 at 22:11
  • You MUST know what is taking the time. You can't just say "Python is faster than C". If the Python part of the processing isn't the time sink, then Python is not causing your trouble. If each loop is 100ms, and the numpy processing in the loop is using 95ms, then THAT'S what you need to optimize. (Of course, even if the loop is 5ms per, it's still a year.) – Tim Roberts Aug 30 '22 at 22:17
  • @TimRoberts Thank you, I will research on that. Anyway, I need to find a solution for that issue. So from this question I have learned that numpy arrays won't solve my problem - unfortunately. – TheArmbreaker Aug 30 '22 at 22:29

1 Answers1

0

Your first line is wrong:

In [172]: myArray = np.array(['A','B'],['A','C'],['B','C'])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [172], in <cell line: 1>()
----> 1 myArray = np.array(['A','B'],['A','C'],['B','C'])

TypeError: array() takes from 1 to 2 positional arguments but 3 were given

With the correct arguments:

In [173]: myArray = np.array((['A','B'],['A','C'],['B','C']))

In [174]: for row in myArray: print(row)
['A' 'B']
['A' 'C']
['B' 'C']

In [175]: myArray.shape, myArray.dtype
Out[175]: ((3, 2), dtype('<U1'))

This is NOT an array of lists. It is a 2d array of strings. Simple iteration is enough to get the rows. In the above iteration row is a 1d array of strings, not a list.

In [176]: type(row)
Out[176]: numpy.ndarray

nditer iterates on the elements of an array, regardless of the dimensions. Trying to do the equivalent of a simple iteration may be possible, but it's tricky and not worth the effort.

In [179]: with np.nditer(myArray) as comb:
     ...:     for i in comb:
     ...:         print(i, type(i), i.shape)
     ...:         
A <class 'numpy.ndarray'> ()
B <class 'numpy.ndarray'> ()
A <class 'numpy.ndarray'> ()
C <class 'numpy.ndarray'> ()
B <class 'numpy.ndarray'> ()
C <class 'numpy.ndarray'> ()

The items of the nditer iteration are 0d arrays.

The nditer docs mention the uses of multiindex if you want to track the multidimensional indices of the iteration.

If you really want nditer to iterate on "rows" you need to make a 1d object dtype array - an array of arrays:

In [189]: arr = np.empty(3,object); arr[:] = list(myArray)
In [190]: arr
Out[190]: 
array([array(['A', 'B'], dtype='<U1'), array(['A', 'C'], dtype='<U1'),
       array(['B', 'C'], dtype='<U1')], dtype=object)

or array of lists:

In [191]: arr = np.empty(3,object); arr[:] = myArray.tolist()
In [192]: arr
Out[192]: array([list(['A', 'B']), list(['A', 'C']), list(['B', 'C'])], dtype=object)

And you still need to supply a "REFS_OK" flag. But even here, nditer has no benefit compared to a simple loop iteration (or better yet no-loop numpy operations.

In short nditer is not a substitute for loops.

edit

With an array of lists:

In [202]: arr = np.empty(3,object); arr[:] = myArray.tolist()

In [203]: arr
Out[203]: array([list(['A', 'B']), list(['A', 'C']), list(['B', 'C'])], dtype=object)

In [204]: with np.nditer(arr, ['refs_ok']) as comb:
     ...:     for i in comb:
     ...:         print(i, type(i), i.shape)
     ...:         
['A', 'B'] <class 'numpy.ndarray'> ()
['A', 'C'] <class 'numpy.ndarray'> ()
['B', 'C'] <class 'numpy.ndarray'> ()

Here each iteration i is a 0d array containing a list object.

With an array of arrays (not 2d), the iteration object is still 0d, but it contains arrays (not the missing commas).

In [207]: arr
Out[207]: 
array([array(['A', 'B'], dtype='<U1'), array(['A', 'C'], dtype='<U1'),
       array(['B', 'C'], dtype='<U1')], dtype=object)

In [208]: with np.nditer(arr, ['refs_ok']) as comb:
     ...:     for i in comb:
     ...:         print(i, type(i), i.shape,i.dtype, i.item())
     ...:         
['A' 'B'] <class 'numpy.ndarray'> () object ['A' 'B']
['A' 'C'] <class 'numpy.ndarray'> () object ['A' 'C']
['B' 'C'] <class 'numpy.ndarray'> () object ['B' 'C']

Taking a hint from https://numpy.org/doc/stable/reference/arrays.nditer.html#buffering-the-array-elements, I can do:

In [213]: with np.nditer(myArray.T.copy(), ['external_loop'], order='F') as comb:
     ...:     for i in comb:
     ...:         print(i, type(i), i.shape)
     ...:         
['A' 'B'] <class 'numpy.ndarray'> (2,)
['A' 'C'] <class 'numpy.ndarray'> (2,)
['B' 'C'] <class 'numpy.ndarray'> (2,)
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Thank you very much :) The Error is a typo from breaking down my use case to an example. Actually I do np.asarray(myList). I will research the simple iteration over rows. But won't I have trouble with iterating over 8.361.453.672 rows with a for loop? – TheArmbreaker Aug 30 '22 at 18:35
  • Yes, iterating for that many rows will be slow, but `nditer` won't make it any better. Iterating on lists of list is better than iterating on an array. To get speed in `numpy` you have to use the compiled numpy methods. They still iterate, but compiled code. And generally they work best on numeric dtype arrays, not strings as in your case. And certainly not doing a `print` like IO iteration for each row. I think you have a lot more basic `numpy` reading ahead of you. There isn't a quick drop in way of using `numpy` effectively. – hpaulj Aug 30 '22 at 19:19
  • `for i in comb:` is still a Python for loop. It's iterating over a `nditer` object, rather than the original array, but that doesn't make the iteration any faster. `for row in myArray.tolist():...` would be something of any improvement. `tolist()` is a compiled method that efficiently converts an array into a pure python list. – hpaulj Aug 30 '22 at 19:24