1

I wrote a tree object in cython that has many nodes, each containing a single unicode character. I wanted to test whether the character gets interned if I use Py_UNICODE or str as the variable type. I'm trying to test this by creating multiple instances of the node class and getting the memory address of the character for each, but somehow I end up with the same memory address, even if the different instances contain different characters. Here is my code:

from libc.stdint cimport uintptr_t

cdef class Node():
    cdef:
        public str character
        public unsigned int count
        public Node lo, eq, hi

    def __init__(self, str character):
        self.character = character

    def memory(self):
        return <uintptr_t>&self.character[0]

I am trying to compare the memory locations like so, from Python:

a = Node("a")
a2 = Node("a")
b = Node("b")
print(a.memory(), a2.memory(), b.memory())

But the memory addresses that prints out are all the same. What am I doing wrong?

1 Answers1

3

Obviously, what you are doing is not what you think you would be doing.

self.character[0] doesn't return the address/reference of the first character (as it would be the case for an array for example), but a Py_UCS4-value (i.e. an usigned 32bit-integer), which is copied to a (local, temprorary) variable on the stack.

In your function, <uintptr_t>&self.character[0] gets you the address of the local variable on the stack, which per chance is always the same because when calling memory there is always the same stack-layout.

To make it clearer, here is the difference to a char * c_string, where &c_string[0] gives you the address of the first character in c_string.

Compare:

%%cython
from libc.stdint cimport uintptr_t

cdef char *c_string = "name";
def get_addresses_from_chars():
    for i in range(4):
        print(<uintptr_t>&c_string[i])

cdef str py_string="name";
def get_addresses_from_pystr():
    for i in range(4):
        print(<uintptr_t>&py_string[i])

An now:

>>> get_addresses_from_chars() # works  - different addresses every time
# ...7752
# ...7753
# ...7754
# ...7755
>>> get_addresses_from_pystr() # works differently - the same address.
# ...0672 
# ...0672
# ...0672
# ...0672

You can see it this way: c_string[...] is a cdef functionality, but py_string[...] is a python-functionality and thus cannot return an address per construction.

To influence the stack-layout, you could use a recursive function:

def memory(self, level):
    if level==0 :
        return <uintptr_t>&self.character[0]
    else:
        return self.memory(level-1)

Now calling it with a.memory(0), a.memory(1) and so on will give you different addresses (unless tail-call-optimization will kick in, I don't believe it will happen, but you could disable the optimization (-O0) just to be sure). Because depending on the level/recursion-depth, the local variable, whose address will be returned, is in a different place on the stack.


To see whether Unicode-objects are interned, it is enough to use id, which yields the address of the object (this is a CPython's implementation detail) so you don't need Cython at all:

>>> id(a.character) == id(a2.character)
# True

or in Cython, doing the same what id does (a little bit faster):

%%cython
from libc.stdint cimport uintptr_t
from cpython cimport PyObject
...
    def memory(self):
        # cast from object to PyObject, so the address can be used
        return <uintptr_t>(<PyObject*>self.character)

You need to cast an object to PyObject *, so the Cython will allow to take the address of the variable.

And now:

 >>> ...
 >>> print(a.memory(), a2.memory(), b.memory())
 # ...5800 ...5800 ...5000

If you want to get the address of the first code-point in the unicode object (which is not the same as the address of the string), you can use <PY_UNICODE *>self.character which Cython will replace by a call to PyUnicode_AsUnicode, e.g.:

%%cython
...   
def memory(self):
    return <uintptr_t>(<Py_UNICODE*>self.character), id(self.character)

and now

>>> ...
>>> print(a.memory(), a2.memory(), b.memory())
# (...768, ...800) (...768, ...800) (...144, ...000)

i.e. "a" is interned and has different address than "b" and code-points bufffer has a different address than the objects containing it (as one would expect).

Community
  • 1
  • 1
ead
  • 32,758
  • 6
  • 90
  • 153
  • Upvoting for the tail-call-optimization I did not know in Cython ! – BlueSheepToken May 17 '19 at 09:20
  • @BlueSheepToken the tail-call-optimization could be performed by the c-compiler (e.g. gcc) while compiling the generated C-code, but most of the time it doesn't happen, because the code created by Cython is not as straight forward as possible. – ead May 17 '19 at 09:27
  • Thank you for your answer! I have some follow-up questions: If I change str to Py_UNICODE, then somehow memory() returns the ordinal value of the character rather than the memory address? Does the above mean that if I use str like this that the character will be interned? Does casting not create a new python str which will be interned? Because if I do replace Py_UNICODE with str, it seems that the memory consumption actually increases. And if I use longer strings [a = Node("aaa"); a2 = Node("aaa")], they still have the same address. :| – Christian Adam May 17 '19 at 09:29
  • @ChristianAdam I have tried to make my answer clearer. But I don't understand what you are talking about: 1) str **is** unicode in Python3. 2) What is a character? strings are interned not characters. 3) Whether a string is interned or not depends on a lot of things and is an implementation detail, which changes from version to version (e.g. https://stackoverflow.com/q/55643933/5769463) - I would not depend on it. – ead May 17 '19 at 10:13
  • @ead Sorry if I wasn't clear. 1) I understand that strings are unicode objects. What I mean is that I can change the variable type declaration in the Node class from "cdef public str character" to "cdef public **Py_UNICODE** character". If I do that, a) memory() returns the ordinal value of the char instead of the memory address and b) the memory consumption decreases. 2) How can I ensure that what is stored in each Node.character is interned? I know I will only have single characters there, so how could I intern these and just store pointers - that should reduce memory considerable, no? – Christian Adam May 17 '19 at 10:36
  • @ChristianAdam When you use `cdef public Py_UNICODE * character` you will get the address of the first character via `&character[0]` but I'm afraid Cython will not manage memory for you - you must take care by yourself. You can use `sys.intern(string)` https://docs.python.org/3/library/sys.html#sys.intern or its equivalent in the C-API to ensure that a string is interned for sure. However, comments is not the right place for such discussion - please ask a new question if somethings is unclear. – ead May 17 '19 at 10:57