-1

I am parsing two big files (Gb size order), that each contains keys and corresponding values. Some keys are shared between the two files, but with differing corresponding values. For each of the files, I want to write to a new file the keys* and corresponding values, with keys* representing keys present both in file1 and file2. I don't care on the key order in the output, but the should absolutely be in the same order in the two files.

File 1:

key1
value1-1
key2
value1-2
key3
value1-3

File2:

key1
value2-1
key5
value2-5
key2
value2-2

A valid output would be:

Parsed File 1:

key1
value1-1
key2
value1-2

Parsed File 2:

key1
value2-1
key2
value2-2

An other valid output:

Parsed File 1:

key2
value1-2
key1
value1-1

Parsed File 2:

key2
value2-2
key1
value2-1

An invalid output (keys in differing order in file 1 and file 2):

Parsed File 1:

key2
value1-2
key1
value1-1

Parsed File 2:

key1
value2-1
key2
value2-2

A last precision is that value sizes are by far bigger than key sizes.

What I am thinking to do is :

  • For each input file, parse and return a dict (let's call it file_index) with keys corresponding to the keys in the file, and values corresponding to the offset where the key was found in the input file.

  • Compute the intersection

    good_keys = file1_index.viewkeys() & file2_index.viewkeys()
    
  • do something like (pseudo-code) :

    for each file:
        for good_key in good_keys:
            offset = file_index[good_key]
            go to offset in input_file
            get corresponding value
            write (key, value) to output file
    

Does iterating over the same set guarantee me to have the exact same order (providing that it is the same set: I won't modify it between the two iterations), or should I convert the set to a list first, and iterate over the list?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Braba
  • 1
  • 1
  • 3

3 Answers3

7

Python's dicts and sets are stable, that is, if you iterate over them without changing them they are guaranteed to give you the same order. This is from the documentation on dicts:

Keys and values are iterated over in an arbitrary order which is non-random, varies across Python implementations, and depends on the dictionary’s history of insertions and deletions. If keys, values and items views are iterated over with no intervening modifications to the dictionary, the order of items will directly correspond.

Daniel Roseman
  • 588,541
  • 66
  • 880
  • 895
  • Exactly what I was looking for (but didn't find it in the set documentation)! Thanks! – Braba Feb 26 '15 at 09:58
  • 6
    Pedantry: the documentation passage you've quoted only guarantees this for dicts. Is there anywhere that the same guarantee is explicitly given for sets? – Mark Amery Aug 15 '18 at 17:40
3

Iteration over an un-modified set will always give you the same order. The order is informed by the current values and their insertion history.

See Why is the order in dictionaries and sets arbitrary? if you are interested in why that is.

Note that if you want to modify your files in place, then that'll only work if your entries have a fixed size. Files cannot be updated somewhere in the middle where that update consists of fewer or more characters than the characters you replaced.

Data in files is like a magnetic tape, you'd have to splice in longer or shorter pieces to replace data in the middle, but you can't do that with a file. You'd have to rewrite everything following the replaced key-value pair to make the rest fit.

Community
  • 1
  • 1
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
0

As already stated out dicts and sets are stable and provide the same order as long as you don't change it. If you want a specific order you can use OrderedDict

From the collections library docs:

>>> from collections import OrderedDict

>>> # regular unsorted dictionary
>>> d = {'banana': 3, 'apple':4, 'pear': 1, 'orange': 2}

>>> # dictionary sorted by key -- OrderedDict(sorted(d.items()) also works
>>> OrderedDict(sorted(d.items(), key=lambda t: t[0]))
OrderedDict([('apple', 4), ('banana', 3), ('orange', 2), ('pear', 1)])

>>> # dictionary sorted by value
>>> OrderedDict(sorted(d.items(), key=lambda t: t[1]))
OrderedDict([('pear', 1), ('orange', 2), ('banana', 3), ('apple', 4)])

>>> # dictionary sorted by length of the key string
>>> OrderedDict(sorted(d.items(), key=lambda t: len(t[0])))
OrderedDict([('pear', 1), ('apple', 4), ('orange', 2), ('banana', 3)])
user937284
  • 2,454
  • 6
  • 25
  • 29