Opening the same file and evaluating its content size, has different results in multiple runs

Question

I'm counting occurrences of certain objects within the same backup file:

with open(file_path, encoding='utf-8') as data:
    backend_data = json.load(data)
    users = {}
    sessions = {}

    for key in backend_data.keys():
        users.update(backend_data[key]['users'])

    for key, value in users.items():
        if 'session' in value:
            sessions.update(value['session'])

    print(len(users))
    print(len(sessions))

While I always get the same len result for users, the len for sessions differs almost each time I call my script.

The file is located on my hard drive and isn't altered in any way during the runs. Here are some sample results of 5 runs:

// 1.
users: 819
sessions: 2373

// 2.
users: 819
sessions: 1995

// 3.
users: 819
sessions: 2340

// 4.
users: 819
sessions: 2340

// 5.
users: 819
sessions: 2069

Some additional information about the file: It's 34535 lines long and has a size of 959kb.

Why do I get different values for one dictionary but not for the other, when I run my script multiple times?

could you provide a link to your file? Sorry I just don't believe it :) Have you tried to reduce file size and getting the same random results (that would do a nice [mcve]) — Jean-François Fabre, Mar 09 '17 at 20:20
also, shot in the dark, can you try doing `for key, value in sorted(users.items()):` and tell us if that changes something (also sort the keys of the first loop) — Jean-François Fabre, Mar 09 '17 at 20:29
@Jean-FrançoisFabre Thanks! Sorting the first keys did the trick! The *backend_data* dict consists of multiple versions and older versions, with less entries, were overriding new version with more entries... Please write an answer, so that I can accept it! ;) — ezcoding, Mar 09 '17 at 20:36

score 1 · Accepted Answer · edited May 23 '17 at 12:31

I may have an idea of what's going on:

Since you're iterating on dictionaries using natural order and order of dictionaries is not guaranteed, you could have some nasty side effects when updating.

Between runs, the order can change (see Why items order in a dictionary changed in Python?) because of hash seed which is random by default.

backend_data[key]['users'] is a dictionary which probably has some keys as lists. Depending on the order, some lists are overwritten by others, or it's the other way round, which doesn't change the length of the first dictionary.

BUT, when you're iterating on the values (second loop) you may have different data entering in the second dictionary.

To fix it, you have to sort your iterables:

with open(file_path, encoding='utf-8') as data:
    backend_data = json.load(data)
    users = {}
    sessions = {}

    for key,bd in sorted(backend_data.items()):
        users.update(bd['users'])

    for key, value in sorted(users.items()):
        if 'session' in value:
            sessions.update(value['session'])

(note the slight optimization of the first loop: don't access the key, you can use items(), and sort on the tuple, which is the same as sorting on the keys)

Note that from Python 3.6, dictionaries order is preserved, so the problem wouldn't have occurred.

That said, since some values overwrite others, your program has a conceptual error because you're not using all the data, and you don't control which parts you're using and which parts you're discarding.

Opening the same file and evaluating its content size, has different results in multiple runs

1 Answers1