1

So recently I've been doing a project whera as optimisation I want to use numpy arrays instead of python list built-in. It would be a 2d array with fixed length in both axes. I also want to maximasie cashe use so that code is as fast as it can be. However when playing with id(var) function I gor unexpected results:

code:

 a = numpy.ascontiguousarray([1,2,3,4,5,6,7,8,9], dtype=numpy.int32)
 for var in a:
    print(hex(id(var)))

returned:

x1aaba10d8f0
0x1aaba1f33d0
0x1aaba10d8f0
0x1aaba1f33d0
0x1aaba10d8f0
0x1aaba1f33d0
0x1aaba10d8f0
0x1aaba1f33d0
0x1aaba10d8f0

which to me it is super weird cus that would mean 2 variables are located in same memory block (is that even a thing ?). anyway - is it me not understanding it correctlly?

As a side question - can the original task of building 2d array be acheaved with less expensive method? Numpy arrays come with many functions I do not need. Only 2 things I need:

  1. to be able to reverse it normally done with [::-1] syntax
  2. check if one == other efficiently

Thank in advance for all the help :-)

Krzychu
  • 83
  • 7
  • Unfortunately you skipped the basic reading of `numpy`, such as how it is stored. Your `a` does not store python objects, so `id(var)` is meaningless. You haven't described how you construct your 2d array (or list), so we can't help you with that. But it sure sounds like `ndarray` is not the right tool for you. It isn't a drop in replacement for lists. – hpaulj Apr 22 '21 at 19:42
  • The data buffer for `a` is a 36 byte `c` array (9*4). You don't access it directly. Speed comes from using compiled numpy methods to iterate or otherwise manipulate it. It does not make much sense to talk about just using a few numpy functions. Iterating as you do is actually slower on an array. – hpaulj Apr 22 '21 at 20:19
  • An easy way of making a 2d array is `np.arange(12).reshape(3,4)`, or `np.zeros((3,4))` – hpaulj Apr 22 '21 at 22:55

2 Answers2

1

id(var) does not work as you think it is. Indeed, id(var) returns a unique ID for the specified object var, but var is not a cell of a. var is a Python object referencing a cell of a. Note that a does not contains such objets as it would be too inefficient (and data would not be contiguous as requested). The reason why you see duplicated IDs is that previous var object as been recycled.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • Thx man, I did further research and indeed it is as u said. Is there a way I can make sure that element of array continous? Ofther than documentation of course. – Krzychu Apr 22 '21 at 21:06
  • You can try the approach provided in [this post](https://stackoverflow.com/questions/51304154) or look the information stored in `a.__array_interface__`. But not that Numpy arrays should always be contiguous. However, Numpy *views* are not always contiguous. – Jérôme Richard Apr 22 '21 at 21:18
  • OP, don't worry about whether arrays are contiguous, or about caching. At least not until you learn to use numpy arrays for basic stuff. – hpaulj Apr 22 '21 at 22:53
0

The kinds of arrays that you really want are unclear, nor is the purpose. But talk of contiguous (or continuous) and caching, suggests that you aren't clear about how Python works.

First, Python is object oriented, all the way down. Integers, strings, lists are all objects of some class, with associated methods, and attributes. For builtin classes we have little say about the storage.

Let's make a small list:

In [89]: alist = [1,2,3,1000,1001,1000,'foobar']
In [90]: alist
Out[90]: [1, 2, 3, 1000, 1001, 1000, 'foobar']

A list has a data buffer that stores references (pointers if you will) to objects else where in memory. The id may give some idea of where, it shouldn't be understood as a 'pointer' in the c language sense.

For this list:

In [91]: [id(i) for i in alist]
Out[91]: 
[9784896,
 9784928,
 9784960,
 140300786887792,
 140300786888080,
 140300786887792,
 140300786115632]

1,2,3 have small id values because Python has initialized small integers (up to 256) at the start. So all uses will have that unique id.

In [92]: id(2)
Out[92]: 9784928

Within the list creation 1000 appears to be unique, but not so outside of that context.

In [93]: id(1001)
Out[93]: 140300786888592

Looks like the string is cached as well - but that's just the interpreter's choice, and we shouldn't count on it.

In [94]: id('foobar')
Out[94]: 140300786115632

The reverse list is a new list, with its own pointer array. But the references are same:

In [95]: rlist = alist[::-1]
In [96]: rlist
Out[96]: ['foobar', 1000, 1001, 1000, 3, 2, 1]
In [97]: rlist[5],id(rlist[5])
Out[97]: (2, 9784928)

Indexing actions like [::-1] should just depend on the number of items in the list. It doesn't depend on where the value actually point to. Same for other copies. Even appending to the array is relatively time independent (it maintains growth space in the data buffer). Actually working with the objects in the list may be depend on where they are stored in memory, but we have little say about that.

A "2d" list is actually a list with list elements; nested lists. The sublists are stored else where in memory, just like strings and numbers. In that sense the nested lists are not contiguous.

So what about arrays?

In [101]: x = np.arange(12)
In [102]: x
Out[102]: array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])
In [104]: x.__array_interface__
Out[104]: 
{'data': (57148880, False),
 'strides': None,             # default (8,)
 'descr': [('', '<i8')],
 'typestr': '<i8',
 'shape': (12,),
 'version': 3}
In [105]: x.nbytes     # 12*8 bytes
Out[105]: 96

x is a ndarray object, with attributes like shape, strides and dtype. And a data buffer. In this case is a c array 96 bytes long, at "57148880. We can't use that number, but I find it useful when comparing this array_interfacedict across arrays. Aview` in particular will have the same, or related value.

In [106]: x.reshape(3,4)
Out[106]: 
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
In [107]: x.reshape(3,4).__array_interface__['data']
Out[107]: (57148880, False)
In [108]: x.reshape(3,4)[1,:].__array_interface__['data']
Out[108]: (57148912, False)     # 32 bytes later

The array data buffer has actual values, not references. Here with int dtype, each 8 bytes is interpreted as a 'int64' value.

Your id iteration effectively asks for a list, [x[i] for i in range(n)]. An element of an array has to be "unboxed", and is a new object, type np.int64. While not an array, it does have a lot of properties in common with a 1 element array.

In [110]: x[4].__array_interface__
Out[110]: 
{'data': (57106480, False),
 ...
 'shape': (),....}

That data value is unrelated to x's.

As long as you use numpy methods on existing arrays, speeds are good, often 10x better than equivalent list methods. But if you start with a list, it takes time to make an array. And treating the array like list is slow.

And the reverse of x?

In [111]: x[::-1].__array_interface__
Out[111]: 
{'data': (57148968, False),
 'strides': (-8,),
 'descr': [('', '<i8')],
 'typestr': '<i8',
 'shape': (12,),
 'version': 3}

It's a new array, but with a different strides (-8,), and data points to the end of the buffer, 880+96-8.

hpaulj
  • 221,503
  • 14
  • 230
  • 353