I am parsing two big files (Gb size order), that each contains keys
and corresponding values
. Some keys
are shared between the two files, but with differing corresponding values
.
For each of the files, I want to write to a new file the keys*
and corresponding values
, with keys*
representing keys present both in file1 and file2. I don't care on the key
order in the output, but the should absolutely be in the same order in the two files.
File 1:
key1
value1-1
key2
value1-2
key3
value1-3
File2:
key1
value2-1
key5
value2-5
key2
value2-2
A valid output would be:
Parsed File 1:
key1
value1-1
key2
value1-2
Parsed File 2:
key1
value2-1
key2
value2-2
An other valid output:
Parsed File 1:
key2
value1-2
key1
value1-1
Parsed File 2:
key2
value2-2
key1
value2-1
An invalid output (keys in differing order in file 1 and file 2):
Parsed File 1:
key2
value1-2
key1
value1-1
Parsed File 2:
key1
value2-1
key2
value2-2
A last precision is that value sizes are by far bigger than key sizes.
What I am thinking to do is :
For each input file, parse and return a
dict
(let's call itfile_index
) with keys corresponding to the keys in the file, and values corresponding to the offset where the key was found in the input file.Compute the intersection
good_keys = file1_index.viewkeys() & file2_index.viewkeys()
do something like (pseudo-code) :
for each file: for good_key in good_keys: offset = file_index[good_key] go to offset in input_file get corresponding value write (key, value) to output file
Does iterating over the same set guarantee me to have the exact same order (providing that it is the same set: I won't modify it between the two iterations), or should I convert the set to a list first, and iterate over the list?