Process Large (10gb) Time Series CSV file into daily files

Question

I am new to Python 3, coming over from R.

I have a very large time series file (10gb) which spans 6 months. It is a csv file where each row contains 6 fields: Date, Time, Data1, Data2, Data3, Data4. "Data" fields are numeric. I would like to iterate through the file and create & write individual files which contain only one day of data. The individual dates are known only by the fact that the date field suddenly changes. Ie, they don't include weekends, certain holidays, as well as random closures due to unforseen events so the vector of unique dates is not deterministic. Also, the number of lines per day is also variable and unknown.

I envision reading each line into a buffer and comparing the date to the previous date.

If the next date = previous date, I append that line to the buffer. I repeat this until next date != previous date, at which point I write the buffer to a new csv file which contains only that day's data (00:00:00 to 23:59:59).

I had trouble appending the new lines with pandas dataframes, and using readline into a list just got too mangled for me. Looking for Pythonic advice.

Possible duplicate of [Reading a huge .csv file](https://stackoverflow.com/questions/17444679/reading-a-huge-csv-file) — ivan_pozdeev, Mar 13 '18 at 01:57
You might want to consider using something like csvkit https://csvkit.readthedocs.io/en/1.0.3/ — OneCricketeer, Mar 13 '18 at 02:00
Possible duplicate of ["Large data" work flows using pandas](https://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas) — OneCricketeer, Mar 13 '18 at 02:06
Honestly, though. `grep 'date' file.csv > small-file.csv` will work out better than writing any code if all you're doing is filtering values in the rows — OneCricketeer, Mar 13 '18 at 02:07

score 2 · Answer 1 · answered Mar 13 '18 at 02:35

pandas isn't a good option here because it reads the enire CSV. The standard csv module iterates line by line and will work better for you. Its pretty straightforward to write nested for loops to read each row and write, but you get extra points if you leverage iterators for shorter code.

itertools.groupby is interesting because it implements the check for a new date for you. After being handed an iterator, it returns iterators that stop whenever a key like the date changes. Those iterators can be consumed by a csv writer.

import csv
import itertools

with open('test_in.csv') as in_fp:
    reader = csv.reader(in_fp)
    for date, row_iter in itertools.groupby(reader, key=lambda row: row[0]):
        out_filename = date.replace('/','-') + '.csv' # todo: name your output file
        with open(out_filename, 'w') as out_fp:
            csv.writer(out_fp).writerows(row_iter)

score 0 · Answer 2 · answered Mar 13 '18 at 22:17

I was getting thrown off in that open(...) actually gets a line. I was doing a separate readline(...) after the open(...)and so unwittingly advancing the iterator and getting bad results.

There is a small problem with csv write which I'll post on new question.

Process Large (10gb) Time Series CSV file into daily files

2 Answers2