2

This is a follow-up to my previous question regarding string interning in Python, though I think it is unrelated enough to qualify as a separate question. In short, when using sys.intern, do I need to pass the string in question to the function upon most/every use, or do I only have to intern the string once and track its reference? To clarify with a pseudo-codeish use case where I do what I think is correct: (see comments)

# stores all words in sequence, 
# we want duplicate words too,
# but those should refer to the same string
# (the reason we want interning)
word_sequence = []
# simple word count dictionary
word_dictionary = {}
for line in text:
    for word in line: # using magic unspecified parsing/tokenizing logic
        # returns a canonical "reference"
        word_i = sys.intern(word)
        word_sequence.append(word_i)
        try:
            # do not need to intern again for
            # specific use as dictonary key,
            # or is something undesirable done
            # by the dictionary that would require 
            # another call here?
            word_dictionary[word_i] += 1 
        except KeyError:
            word_dictionary[word_i] = 1

# ...somewhere else in a function far away...
# Let's say that we want to use the word sequence list to
# access the dictionary (even the duplicates):
for word in word_sequence:
    # Do NOT need to re-sys.intern() word
    # because it is the same string object
    # interned previously?
    count = word_dictionary[word]
    print(count)

What if I want to access words in a different dictionary? Do I need to use sys.intern() again when inserting a key:value, even if the key has already been interned? May I have some clarification? Thank you in advance.

synchronizer
  • 1,955
  • 1
  • 14
  • 37

1 Answers1

1

You have to use sys.intern() each time you have a new string object, otherwise you can't guarantee that you have the same object for the value represented.

However, your word_seq list contains references to interned string objects. You don't have to use sys.intern() again on those. At no point is anything creating a copy of a string here (which would be unnecessary and wasteful).

All sys.intern() does is map the string value to a specific object that has that value. As long you then keep a reference to the return value, you are guaranteed to still have access that one specific object.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • So to clarify, the code above works. I do *not* need to change word_dictionary[word_i] += 1 into word_dictionary[sys.intern(word_i)] += 1. Is that correct? I wasn't sure how Python dictionaries worked with string keys internally--for example--whether they make separate copies of sorts. – synchronizer Jan 01 '17 at 20:24
  • 1
    @synchronizer: they don't make a separate copies; all they need to do is store the reference. Most things in Python are just references. And if dictionaries didn't, then there would be no point in Python using interning for all identifiers as it does now. :-) – Martijn Pieters Jan 01 '17 at 20:25
  • I see. Thank you for your help! – synchronizer Jan 01 '17 at 20:26
  • 1
    @synchronizer: as it stands, CPython uses interning *heavily*, all *because* `dict` stores references to keys, not copies. It makes dictionaries faster because before testing if the value is equal, Python will check if you have the same object, which is a simple reference equality test (do two references point to the same memory location). – Martijn Pieters Jan 01 '17 at 20:26