I have a large file which has two numbers per line and is sorted by the second column. I make a dictionary of lists keyed on the first column.
My code looks like
from collections import defaultdict
d = defaultdict(list)
for line in fin.readline():
vals = line.split()
d[vals[0]].append(vals[1])
process(d)
However the input file large is too large so d
will not fit into memory.
To get round this I can in principle read in chunks of the file at a time but I need to make an overlap between the chunks so that process(d)
won't miss anything.
In pseudocode I could do the following.
- Read 100 lines creating the dictionary
d
. - Process the dictionary
d
- Delete everything from
d
that is not within 10 of the max value seen so far. - Repeat but making sure we don't have more than 100 lines worth of data in
d
at any time.
Is there a nice way to do this in python?
Update. More details of the problem. I will use d
when reading in a second file of pairs where I will output the pair if depending on how many values there are in the list associated with the first value in d
which are within 10. The second file is also sorted by the second column.
Fake data. Let's say we can fit 5 lines of data into memory and we need the overlap in values to be 5 as well.
1 1
2 1
1 6
7 6
1 16
So now d is {1:[1,6,16],2:[1],7:[6]}.
For the next chunk we only need to keep the last value (as 16-6 > 5). So we would set
d to be {1:[16]} and continue reading the next 4 lines.