1

I have a large dataset in CSV format that has the following structure:

time,value,id,value2,value3
2002141150250586,23.034,101,35.93,34.39
.
.
2002141150250586,24.349,2,24.45,67.99

Upon investigating the file, I found that the there is a batch of 100 datapoints for the same timestamp, in descending id order (from 101 to 2).

I was initially able to acquire the first 100 data using the following code:

import csv
import datetime
import itertools

def main():
    with open('myfile.csv','r',encoding='utf-8-sig') as csv_file:
        csv_reader = csv.DictReader(csv_file)
        for row in itertools.isslice(csv_reader, 0, 100):
            ID = row['id']
            timestamp = datetime.datetime.strptime(row['time'], ""%y%m%d%H%M%S%f")
            print(f'{ID}: {ts}')

which printed out the first 100 lines and was verifiable via the id (from 101 to 2).

How do I keep on grabbing the subsequent batches (every 100 datapoints) till the EOF since the file is very large?

Shan-Desai
  • 3,101
  • 3
  • 46
  • 89
  • https://stackoverflow.com/questions/4956984/how-do-you-split-reading-a-large-csv-file-into-evenly-sized-chunks-in-python – Harsha Biyani Apr 16 '20 at 10:26
  • Are the batches of always exactly 100 lines ? – log0 Apr 16 '20 at 10:36
  • @log0 yes, I was able to go through the file and the last set still had `id` ranging from 101 to 2. I assume that the batches are always exactly 100 lines. – Shan-Desai Apr 16 '20 at 10:40

1 Answers1

2

You can use the following grouping function to group rows in batches of 100:

from itertools import chain, islice

def grouper(iterable, n):
  "grouper(ABCDE,2) => AB CD E"
  iterable = iter(iterable)
  while True:
    yield chain([next(iterable)], islice(iterable, n-1))

i.e.

csv_reader = csv.DictReader(csv_file)
for group in grouper(csv_reader, 100):
  for row in group:
    ...

the grouper function stores one line in memory at most.

Slight update in case batches are not always of 100 lines. In this case we only group together subsequent lines with the same timestamp:

csv_reader = csv.DictReader(csv_file)
for key, group in itertools.groupby(csv_reader, lambda row: row['time']):
  for row in group:
    ...
log0
  • 10,489
  • 4
  • 28
  • 62
  • This works well with only one exception where each `row` will have string i.e. `row['time']` printed first and then the complete OrderedDict iterator. I added a `time.sleep` and found the output to be: `2002141150250586 ` – Shan-Desai Apr 16 '20 at 10:49
  • added an edit where by in the for loop it might be easier to use `for key_time, group in itertools.groupby()` – Shan-Desai Apr 16 '20 at 10:52
  • 1
    Yes my bad, groupby returns the key and the group. fixing my answer – log0 Apr 16 '20 at 10:53