Python iteration order on a set

Question

I am parsing two big files (Gb size order), that each contains keys and corresponding values. Some keys are shared between the two files, but with differing corresponding values. For each of the files, I want to write to a new file the keys* and corresponding values, with keys* representing keys present both in file1 and file2. I don't care on the key order in the output, but the should absolutely be in the same order in the two files.

File 1:

key1
value1-1
key2
value1-2
key3
value1-3

File2:

key1
value2-1
key5
value2-5
key2
value2-2

A valid output would be:

Parsed File 1:

key1
value1-1
key2
value1-2

Parsed File 2:

key1
value2-1
key2
value2-2

An other valid output:

Parsed File 1:

key2
value1-2
key1
value1-1

Parsed File 2:

key2
value2-2
key1
value2-1

An invalid output (keys in differing order in file 1 and file 2):

Parsed File 1:

key2
value1-2
key1
value1-1

Parsed File 2:

key1
value2-1
key2
value2-2

A last precision is that value sizes are by far bigger than key sizes.

What I am thinking to do is :

For each input file, parse and return a dict (let's call it file_index) with keys corresponding to the keys in the file, and values corresponding to the offset where the key was found in the input file.

Compute the intersection

good_keys = file1_index.viewkeys() & file2_index.viewkeys()

do something like (pseudo-code) :

for each file:
    for good_key in good_keys:
        offset = file_index[good_key]
        go to offset in input_file
        get corresponding value
        write (key, value) to output file

Does iterating over the same set guarantee me to have the exact same order (providing that it is the same set: I won't modify it between the two iterations), or should I convert the set to a list first, and iterate over the list?

are you actually using a dict or a set? The order won't change once you don't change the set — Padraic Cunningham, Feb 26 '15 at 09:34
That's a lot of text for a simple question! Put the question at the beginning, and delete anything else which doesn't improve the question. — Peter Wood, Feb 26 '15 at 09:37

score 7 · Accepted Answer · answered Feb 26 '15 at 09:36

7

Python's dicts and sets are stable, that is, if you iterate over them without changing them they are guaranteed to give you the same order. This is from the documentation on dicts:

Keys and values are iterated over in an arbitrary order which is non-random, varies across Python implementations, and depends on the dictionary’s history of insertions and deletions. If keys, values and items views are iterated over with no intervening modifications to the dictionary, the order of items will directly correspond.

answered Feb 26 '15 at 09:36

Daniel Roseman

588,541
66
880
895

Exactly what I was looking for (but didn't find it in the set documentation)! Thanks! – Braba Feb 26 '15 at 09:58
6

Pedantry: the documentation passage you've quoted only guarantees this for dicts. Is there anywhere that the same guarantee is explicitly given for sets? – Mark Amery Aug 15 '18 at 17:40

score 3 · Answer 2 · edited May 23 '17 at 12:28

Iteration over an un-modified set will always give you the same order. The order is informed by the current values and their insertion history.

See Why is the order in dictionaries and sets arbitrary? if you are interested in why that is.

Note that if you want to modify your files in place, then that'll only work if your entries have a fixed size. Files cannot be updated somewhere in the middle where that update consists of fewer or more characters than the characters you replaced.

Data in files is like a magnetic tape, you'd have to splice in longer or shorter pieces to replace data in the middle, but you can't do that with a file. You'd have to rewrite everything following the replaced key-value pair to make the rest fit.

Thanks for the precision on why it is deterministic! – Braba Feb 26 '15 at 09:59 — Braba, Feb 26 '15 at 09:59

score 0 · Answer 3 · answered Feb 26 '15 at 09:48

As already stated out dicts and sets are stable and provide the same order as long as you don't change it. If you want a specific order you can use OrderedDict

From the collections library docs:

>>> from collections import OrderedDict

>>> # regular unsorted dictionary
>>> d = {'banana': 3, 'apple':4, 'pear': 1, 'orange': 2}

>>> # dictionary sorted by key -- OrderedDict(sorted(d.items()) also works
>>> OrderedDict(sorted(d.items(), key=lambda t: t[0]))
OrderedDict([('apple', 4), ('banana', 3), ('orange', 2), ('pear', 1)])

>>> # dictionary sorted by value
>>> OrderedDict(sorted(d.items(), key=lambda t: t[1]))
OrderedDict([('pear', 1), ('orange', 2), ('banana', 3), ('apple', 4)])

>>> # dictionary sorted by length of the key string
>>> OrderedDict(sorted(d.items(), key=lambda t: len(t[0])))
OrderedDict([('pear', 1), ('apple', 4), ('orange', 2), ('banana', 3)])

Python iteration order on a set

3 Answers3

Linked