Python - Why not all immutable objects are always cached?

Question

I am not sure what is happening under the hood with regards to the Python object model for the code below.

You can download the data for the ctabus.csv file from this link

import csv

def read_as_dicts(filename):
    records = []
    with open(filename) as f:
        rows = csv.reader(f)
        headers = next(rows)

        for row in rows:
            route = row[0]
            date = row[1]
            daytype = row[2]
            rides = int(row[3])
            records.append({
                    'route': route,
                    'date': date,
                    'daytype': daytype,
                    'rides': rides})

    return records

# read data from csv
rows = read_as_dicts('ctabus.csv')
print(len(rows)) #736461

# record route ids (object ids)
route_ids = set()
for row in rows:
    route_ids.add(id(row['route']))

print(len(route_ids)) #690072

# unique_routes
unique_routes = set()
for row in rows:
    unique_routes.add(row['route'])

print(len(unique_routes)) #185

When I call print(len(route_ids)) it prints "690072". Why did Python end up creating these many objects?

I expect this count to be either 185 or 736461. 185 because, when I count the unique routes in set the length of that set comes out to be 185. 736461 because, this is the total number of records in csv file.

What is this weird number "690072"?

I am trying to understand why this partial caching? Why python can't perform a full caching something like below.

import csv

route_cache = {}

#some hack to cache
def cached_route(routename):
    if routename not in route_cache:
        route_cache[routename] = routename
    return route_cache[routename]

def read_as_dicts(filename):
    records = []
    with open(filename) as f:
        rows = csv.reader(f)
        headers = next(rows)

        for row in rows:
            row[0] = cached_route(row[0]) #cache trick
            route = row[0]
            date = row[1]
            daytype = row[2]
            rides = int(row[3])
            records.append({
                    'route': route,
                    'date': date,
                    'daytype': daytype,
                    'rides': rides})

    return records

# read data from csv
rows = read_as_dicts('ctabus.csv')
print(len(rows)) #736461

# unique_routes
unique_routes = set()
for row in rows:
    unique_routes.add(row['route'])

print(len(unique_routes)) #185

# record route ids (object ids)
route_ids = set()
for row in rows:
    route_ids.add(id(row['route']))

print(len(route_ids)) #185

id returns the memory location of the object. Some memory might be recycled in your loop and addresses reused, so you get a lower number. What are you trying to achieve? — Peter Wood, Nov 25 '18 at 11:32
I have too many pending edits but can I suggest changing the title of this question to something others are more likely to find? Perhaps, 'Python - Less Unique IDs than Objects', or something like that. — SuperShoot, Nov 25 '18 at 11:42
Regarding your edit: so in the end you are asking why not *all* immutable objects are always cached? — timgeb, Nov 25 '18 at 11:47
@timgeb Yes. I mean whats the thought process behind caching short bits but not the large ones. — ThinkGeek, Nov 25 '18 at 11:48
@PeterWood I have updated the question with details on what I am trying to understand. — ThinkGeek, Nov 25 '18 at 11:50
Good question. I suppose for large strings the equality check of O(len(string)) would be a performance bump if performed with each cache lookup. In addition, the memory set aside for the cache would have to be huge. Also, for user defined classes, there's no such thing as immutable in Python. — timgeb, Nov 25 '18 at 11:52
@timgeb I can buy the point of user defined classes. But in my example routenames are strings. And it is hard to understand why this behaviour. — ThinkGeek, Nov 25 '18 at 11:54
Well it was just a guess. I added the python-internals tag, maybe it attracts somebody who can answer authoritatively. — timgeb, Nov 25 '18 at 11:59

ead · Accepted Answer · 2019-04-13T20:06:58.013

A typical record from the file looks like following:

rows[0]
{'route': '3', 'date': '01/01/2001', 'daytype': 'U', 'rides': 7354}

That means most of your immutable objects are strings and only the 'rides'-value is an integer.

For small integers (-5...255), Python3 keeps an integer pool - so these small integers feels like being cached (as long as PyLong_FromLong and Co. are used).

The rules are more complicated for strings - they are, as pointed out by @timgeb, interned. There is a greate article about interning, even if it is about Python2.7 - but not much changed since then. In a nutshell, the most important rules are:

all strings of length 0 and 1 are interned.
stings with more than one character are interned if they constist of characters that can be used in identifiers and are created at compile time either directly or through peephole optimization/constant folding (but in the second case only if the result is no longer than 20 characters (4096 since Python 3.7).

All of the above are implementation details, but taking them into account we get the following for the row[0] above:

'route', 'date', 'daytype', 'rides' are all interned because they created at compile time of the function read_as_dicts and don't have "strange" characters.
'3' and 'W' are interned because their length is only 1.
01/01/2001 isn't interned because it is longer than 1, created at runtime and whouldn't qualify anyway because it has character / in it.
7354 isn't from the small integer pool, because too large. But other entries might be from this pool.

This was an explanation for the current behavior, with only some objects being "cached".

But why doesn't Python cache all created strings/integer?

Let's start with integers. In order to be able to look-up fast if an integer-number is already created (much faster than O(n)), one has to keep an additional look-up data-structure, which needs additional memory. However, there are so many integers, that the probability to hit one already existing integer again is not very high, so the memory overhead for the look-up-data-structure will not be repaid in the most cases.

Because strings need more memory, the relative (memory) cost of the look-up data-structure isn't that high. But it doesn't make any sense to intern a 1000-character-string, because the probability for a randomly created string to have the very same characters is almost 0!

On the other hand, if for example a hash-dictionary is used as the look-up structure, the calculation of the hash will take O(n) (n-number of characters), which probably won't pay off for large strings.

Thus, Python makes a trade off, which works pretty well in most scenarios - but it cannot be perfect in some special cases. Yet for those special scenarios you can optimize per hand using sys.intern().

Note: Having the same id doesn't mean to be the same object, if the live time of two objects don't overlapp, - so your reasoning in the question isn't entrirely watherproof - but this is of no consequence in this special case.

Also, I'd assume hashing in order to find out if a string already has been cached would come with some performance cost. — timgeb, Nov 27 '18 at 19:14

score 4 · Answer 2 · answered Nov 25 '18 at 11:36

There are 736461 elements in rows.

Thus, you are adding id(row['route']) to the set route_ids 736461 times.

Since whatever id returns is guaranteed to be unique among simultaneously existing objects, we would expect route_ids to end up with 736461 items minus whatever the amount of strings that are small enough to be cached for two 'route' keys of two rows in rows.

Turns out that in your specific case, that number is 736461 - 690072 == 46389.

Caching of small immutable objects (strings, integers) is an implementation detail you should not rely on - but here's a demo:

>>> s1 = 'test' # small string
>>> s2 = 'test'
>>> 
>>> s1 is s2 # id(s1) == id(s2)
True
>>> s1 = 'test'*100 # 'large' string
>>> s2 = 'test'*100
>>> 
>>> s1 is s2
False

In the end, there's probably a semantic error in your program. What do you want to do with the unique ids of the Python objects?

Thanks for the answer. I am just trying to understand how python caches the memory under the hood. I have updated the question accordingly. — ThinkGeek, Nov 25 '18 at 11:43

Python - Why not all immutable objects are always cached?

2 Answers2

Linked

Related