Process large file in chunks

Question

I have a large file which has two numbers per line and is sorted by the second column. I make a dictionary of lists keyed on the first column.

My code looks like

from collections import defaultdict
d = defaultdict(list)
for line in fin.readline():
    vals = line.split()
    d[vals[0]].append(vals[1])
process(d)

However the input file large is too large so d will not fit into memory.

To get round this I can in principle read in chunks of the file at a time but I need to make an overlap between the chunks so that process(d) won't miss anything.

In pseudocode I could do the following.

Read 100 lines creating the dictionary d.
Process the dictionary d
Delete everything from d that is not within 10 of the max value seen so far.
Repeat but making sure we don't have more than 100 lines worth of data in d at any time.

Is there a nice way to do this in python?

Update. More details of the problem. I will use d when reading in a second file of pairs where I will output the pair if depending on how many values there are in the list associated with the first value in d which are within 10. The second file is also sorted by the second column.

Fake data. Let's say we can fit 5 lines of data into memory and we need the overlap in values to be 5 as well.

So now d is {1:[1,6,16],2:[1],7:[6]}.

For the next chunk we only need to keep the last value (as 16-6 > 5). So we would set

d to be {1:[16]} and continue reading the next 4 lines.

It would be easier to help you, if you would describe, what you want to do and not how you want to do it. You describe an exact solution for a problem you do not tell us, but want a nicer solution for that problem. ;-) — Achim, Jul 25 '13 at 19:04
Yep Achim, this sounds like an XY-problem ;-) http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem — tamasgal, Jul 25 '13 at 19:08
@Achim I would like to process the large file but it won't fit into memory. As the second column is sorted and the function process(d) only depends on elements with a second column value within 10 (say) this is the solution I came up with to do it in limited memory. The problem is that I am new to python so not sure how to code it. — Simd, Jul 25 '13 at 19:10
Are the data correlated or can you let's say process the file chunks in parallel? Why do you need the overlap? Sounds a bit too "paranoid" for me… — tamasgal, Jul 25 '13 at 19:11
You could process them in parallel as long as you allowed an overlap. The function process(d) only depends on elements with a second column value within 10 (say). — Simd, Jul 25 '13 at 19:12
Sounds as a clever usage of merge sort would solve your problem much faster than any hand written code. But as you still don't tell what you really want to do ... To optimize an algorithm we need to know the algorithm. — Achim, Jul 25 '13 at 19:27
might you want to take a look to `pandas` http://pandas.pydata.org/ — Victor Castillo Torres, Jul 25 '13 at 20:09

score 2 · Accepted Answer · edited May 23 '17 at 12:21

2

Have you tried out the Pandas library, and in particular reading your data into a DataFrame then using groupby on the first column?

Pandas will let you do a lot of bulk operations effectively across your data, and you can read it in lazily if you want to.

edited May 23 '17 at 12:21

Community

1
1

answered Jul 25 '13 at 20:27

Kyle Kelley

13,804
8
49
78

I'll see if I can make a better example for you. It would help if you posted some fake data with expected results. – Kyle Kelley Jul 25 '13 at 20:38
Posted some fake data. – Simd Jul 25 '13 at 20:57

score 0 · Answer 2 · answered Jul 25 '13 at 19:19

0

You don't need default dict unless something strange is going on with the file, but you haven't mentioned what that is. Instead, use a list, which keeps your data in line order, that way you can process it using the appropriate slices thus:

d = []
for line in fin.readline():
    vals = line.split()
    d.append(vals)
    if not len(d)%100:
        process(d)
        d = d[90:]
process(d)

answered Jul 25 '13 at 19:19

Tom Rose

1,539
12
9

I need a dictionary as I am keying on the first values. So if 3 appears twenty times I want to be able to look up all the values it appeared with. Also d = d[90:] isn't a function of the *values* in the second column so you might get the overlap I think. – Simd Jul 25 '13 at 19:58
My understanding was that you wanted to batch the file into chunks of lines, so d=d[90:] *shouldn't* be a function of the values in any way. – Tom Rose Jul 25 '13 at 20:19
The problem is that the overlap has to be for a range of 10 in the *second column* values. – Simd Jul 25 '13 at 20:20
That creates an edge case where the process will break. For example, suppose that you end up with 100 values all within 10 of the max value seen so far, then you'll get stuck because nothing can be eliminated. You'll need to create an orthogonality between nominal value and line number. I have to go now, but good luck :) – Tom Rose Jul 25 '13 at 20:39
Yes that is true. I think in that case the code will just have to report an error. – Simd Jul 25 '13 at 20:41

tamasgal · Answer 3 · 2013-07-25T20:07:56.043

0

You could do this something like this:

n_process = 100
n_overlap = 10
data_chunk = []
for line in fin.readline():
    vals = line.split()
    data_chunk.append(vals)
    if len(data_chunk) == n_process:
        process(data_chunk)
        data_chunk = data_chunk[-n_overlap:]

When using a dictionary, data can be overwritten if multiple occurrences of numbers in the first column are present in a data sample. Also notice, that you need to use OrderedDict, since a dict does not have an order in python. However, in my opinion, OrderedDict is in most cases a sign of bad code design.

And by the way: we still don't know why you're trying to do this way…

edited Jul 25 '13 at 20:07

answered Jul 25 '13 at 19:19

tamasgal

24,826
18
96
135

Is data_chunk.append(vals) the same as data_chunk[vals[0]].append(vals[1]) ? I mean does it key on vals[0] somehow? – Simd Jul 25 '13 at 19:34
Thank you. The problem I think is with data_chunk = data_chunk[-n_overlap:] (also your answer has got rid of the dictionary I think?). I need the overlap to be a function of the second column values. – Simd Jul 25 '13 at 19:57
`data_chunk[vals[0]].append(vals[1]) ` doesn't really make sense. This means appending `vals[1]` to a list in the dictionary `data_chunk` at the key `vale[0]`. This is definitely not the same as `data_chunk.append(vals)`, which simply adds the `vals` tuple to the list `data_chunk`. – tamasgal Jul 25 '13 at 19:58
`data_chunk = data_chunk[-n_overlap:]` simply "deletes" the old data_chunk and replaces it with the overlap of the list. If `d=['a', 'b', 'c', 'd']`, then `d[-2:]` is a list of the last two elements, namely `['c', 'd']`. – tamasgal Jul 25 '13 at 20:00
OK but I really do want a dictionary of lists so I can look up which values are associated with each value in the first column. The prbblem with the overlap is that it might be the wrong size. if d={1:[2,4,60],2:{4,6,80}] then we only need to keep the value 80. That is values that within 10 of the max we have seen so far. – Simd Jul 25 '13 at 20:00
If you want a dictionary, you should be absolutely sure, that no number in the first column occurs more than ones within your data-sample... – tamasgal Jul 25 '13 at 20:01
I updated the pseudocode and hopefully improved it. Does it make more sense now? – Simd Jul 25 '13 at 20:05
Thanks for the update but it still doesn't work. The problem is just how to handle the overlap properly. Also, the append prevents any overwriting doesn't it? – Simd Jul 25 '13 at 20:16
I think my question has confused people. Please feel free to vote to close. – Simd Jul 25 '13 at 20:35

Process large file in chunks

3 Answers3