1

Imagine we have a big file with rows as follows

ID     value     string
1      105       abc 
1      98        edg
1      100       aoafsk
2      160       oemd
2      150       adsf 
...

Say the file is named file.txt and is separated by tab.

I want to keep the largest value for each ID. The expected output is

ID     value     string
1      105       abc 
2      160       oemd
...

How can I read it by chunks and process the data? If I read the data in chunks, how can I make sure at the end of each chunk the records are complete for each ID?

Community
  • 1
  • 1
StayLearning
  • 601
  • 2
  • 7
  • 18

2 Answers2

0

Keep track of the data in a dictionary of this format:

data = {
    ID: [value, 'string'],
}

As you read each line from the file, see if that ID is already in the dict. If not, add it; if it is, and the current ID is bigger, replace it in the dict.

At the end, your dict should have every biggest ID.

# init to empty dict
data = {}

# open the input file
with open('file.txt', 'r') as fp:

    # read each line
    for line in fp:

          # grab ID, value, string
          item_id, item_value, item_string = line.split()

          # convert ID and value to integers
          item_id = int(item_id)
          item_value = int(item_value)

          # if ID is not in the dict at all, or if the value we just read
          # is bigger, use the current values
          if item_id not in data or item_value > data[item_id][0]:
              data[item_id] = [item_value, item_string]

for item_id in data:
    print item_id, data[item_id][0], data[item_id][1]

Dictionaries don't enforce any specific ordering of their contents, so at the end of your program when you get the data back out of the dict, it might not be in the same order as the original file (i.e. you might see ID 2 first, followed by ID 1).

If this matters to you, you can use an OrderedDict, which retains the original insertion order of the elements.

(Did you have something specific in mind when you said "read by chunks"? If you meant a specific number of bytes, then you might run into issues if a chunk boundary happens to fall in the middle of a word...)

John Gordon
  • 29,573
  • 7
  • 33
  • 58
  • Your answer works properly. One question. Which one is more efficient: processing line by line or reading in in chunks and process chunk by chunk or reading the file and process them at once (if memory permits)? I am still curious about how the chunks would work. So I will leave the question open for a while. If no better answer, will take yours. Thanks. – StayLearning Oct 26 '16 at 22:48
  • I doubt there would be much difference, as the OS is already doing input buffering under the hood. – John Gordon Oct 26 '16 at 23:29
0

Code

import csv
import itertools as it
import collections as ct


with open("test.csv") as f:                                
    reader = csv.DictReader(f, delimiter=" ")              # 1
    for k, g in it.groupby(reader, lambda d: d["ID"]):     # 2
        print(max(g, key=lambda d: float(d["value"])))     # 3

# {'value': '105', 'string': 'abc', 'ID': '1'}
# {'value': '160', 'string': 'oemd', 'ID': '2'}

Details

The with block ensures safe opening and closing of file f. The file is iterable allowing you to loop over it or ideally apply itertools.

  1. For each line of f, csv.DictReader splits the data and maintains header-row information as key-value pairs of a dictionary ,e.g. [{'value': '105', 'string': 'abc', 'ID': '1'}, ...

  2. This data is iterable and passed to groupby that chunks all of the data by ID. See this post from more details on how groupby works.

  3. The the max() builtin combined with a special key function returns the dicts with the largest "value". See this tutorial for more details on the max() function.

pylang
  • 40,867
  • 14
  • 129
  • 121