6

In Python (3.7 and above) I would like to obtain a reference to a dict key. More precisely, let d be a dict where the keys are strings. In the following code, the value of k is potentially stored at two distinct locations in memory (one pointed to by the dict and one pointed to by k), whereas the value of v is stored at only one location (the one pointed to by the dict).

# d is a dict
# k is a string dynamically constructed, in particular not from iterating over d's keys
if k in d:
    v = d[k]
    # Now store k and v in other data structures

In my case, the dict is very large and the string keys are very long. To keep memory usage down I would like to replace k with a pointer to the corresponding string used by d before storing k in other data structures. Is there a straightforward way of doing this, that is using the keys of the dict as a string pool?

(Footnote: this may seem as premature optimisation, and perhaps it is, but being an old-school C programmer I sleep better at night doing "memory tricks". Joke aside, I do genuinely would like to know the answer out of curiosity, and I am indeed going to run my code on a Raspberry Pi and will probably face memory issues.)

DustByte
  • 651
  • 6
  • 16
  • https://stackoverflow.com/questions/15541404/python-string-interning or https://stackoverflow.com/questions/tagged/string-interning explains about python string memory handling & the mechanisms behind it - If I were you I would just not care until its a proofen problem. – Patrick Artner Aug 02 '20 at 11:40
  • 4
    I feel like this is a (very) premature optimization. Have you actually ran into memory issues? if not, don't worry about it and let Python manage its memory. – DeepSpace Aug 02 '20 at 11:41
  • Anyway, `v = d[k]` doesn't allocate more memory. It just creates a reference called `v` to whatever `d[k]` points to – DeepSpace Aug 02 '20 at 11:42
  • Why do you think that a copy of an existing object would be made when you decide to use it as a key in some dict? – Thierry Lathuille Aug 02 '20 at 11:48
  • 1
    How would you get into that situation? Where would the two string objects with the same value come from? – superb rain Aug 02 '20 at 12:13
  • @DeepSpace They didn't say `v = d[k]` allocates more memory and they already said that `v` is only stored once. They're asking about `k` (and its equal string in the dict). – superb rain Aug 02 '20 at 12:18
  • 3
    If you have such large strings as keys to `d`, could you also store `hash(key)`? Then when `k in d` is `True`, store `d2[hash(k)] = v`? – quamrana Aug 02 '20 at 12:32
  • @quamrana Might cause collisions, though. – superb rain Aug 02 '20 at 12:55
  • I have accepted the answer below as it provides a direct solution to my problem. However, in practice, I will abandon my "premature optimisation" and incorporate hashes and using a different approach. I do appreciate the comments above, such as "not to worry about it" by @DeepSpace – DustByte Aug 03 '20 at 08:15
  • @superbrain I got into this situation because the dict is a cache, mapping strings to certain properties. While iterating over new strings (constructed dynamically) I will create a new, fresh cache (dict). Many of the key-value pairs from the old dict will make it to the new dict, hence I found it a waste allocating new space for strings already present in memory. – DustByte Aug 03 '20 at 08:21

1 Answers1

7

Where does the key k come from? Is it dynamically constructed by something like str.join, + , slicing another string, bytes.decode etc? Is it read from a file or input()? Did you get it from iterating over d at some point? Or does it originate from a literal somewhere in your source code?

In the last two cases, you don't need to worry about it since it is going to be a single instance anyway.

If not, you could use sys.intern to intern your keys. If a == b then sys.intern(a) is sys.intern(b).

Another possible solution, in case you might want to garbage collect the strings at some point or you want to intern some non-string values, like tuples of strings, you could do the following:

# create this dictionary once after `d` has all the right keys
canonical_keys = {key: key for key in d}

k = canonical_keys.get(k, k) # use the same instance if possible

I recommend reading up on Python's data model.

Jasmijn
  • 9,370
  • 2
  • 29
  • 43