Context: I built a Tree data structure that stores single characters in its Nodes in cython. Now I'm wondering whether I can save save memory if I intern all those characters. And whether I should use Py_UNICODE as the variable type or regular str. This is my stripped-down Node object, using Py_UNICODE:
from libc.stdint cimport uintptr_t
from cpython cimport PyObject
cdef class Node():
cdef:
public Py_UNICODE character
def __init__(self, Py_UNICODE character):
self.character = character
def memory(self):
return <uintptr_t>&self.character
If first tried to see whether the characters are interned automatically. If I import that class in Python and create multiple objects with different or the same character, these are the results I get:
a = Node("a")
a_py = a.character
a2 = Node("a")
b = Node("b")
print(a.memory(), a2.memory(), b.memory())
# 140532544296704 140532548558776 140532544296488
print(id(a.character), id(a2.character), id(b.character), id(a_py))
# 140532923573504 140532923573504 140532923840528 140532923573504
So from that I would have concluded that Py_UNICODE is not interned automatically and that using id() in python does not give me the actual memory address, but that of a copy (I suppose python does intern single unicode characters automatically and then just returns the memory address of that to me).
Next I tried doing the same thing when using str instead. Simply replacing Py_UNICODE with str didn't work, so this is how I'm trying to do it now:
%%cython
from libc.stdint cimport uintptr_t
from cpython cimport PyObject
cdef class Node():
cdef:
public str character
def __init__(self, str character):
self.character = character
def memory(self):
return <uintptr_t>(<PyObject*>self.character)
And these are the results I get with that:
...
print(a.memory(), a2.memory(), b.memory())
# 140532923573504 140532923573504 140532923840528
print(id(a.character), id(a2.character), id(b.character), id(a_py))
# 140532923573504 140532923573504 140532923840528 140532923573504
Based on that, I first thought that single character str are interned in cython as well and that cython just doesn't need to copy the characters from python, explaining why id() and .memory() give the same address. But then I tried using longer strings got the same results, from which I probably don't want to conclude that longer strings are also interned automatically? It's also the case that my tree uses less memory if I use Py_UNICODE, so that doesn't make much sense if str were interned, but Py_UNICODE isn't. Can someone explain this behaviour to me? And how would I go about interning?
(I'm testing this in Jupyter, in case that makes a difference)
Edit: Removed an unnecessary id comparison of nodes instead of characters.