efficiently aggregate large CSV file with multiple int columns

Question

I have a large CSV file with some string type columns (dtype object) and other columns that are int64 type.
Both the string columns and the integer columns can be empty in the CSV. An empty value in the integer columns represent zero and an empty string should stay an empty string.

I want to aggregate all the integer columns over all the other columns. What is the best approach for this?

The output should finally be a single CSV file that is aggregated. The file might not fit into memory as a whole, which is why I tried to use Dask but I am having a hard time understanding how to do this.

This is a working pandas code for a small CSV that fits into memory:

import pandas as pd
df = pd.read_csv("input_file.csv", na_filter=False)
df[metrics_to_aggregate] = df[metrics_to_aggregate].fillna(0).apply(pd.to_numeric)
df = df.groupby(dimensions_to_aggregate, as_index=False).sum()
df.to_csv("output_file.csv", index=False)

For example,input df can be something like:

  dimension1 dimension2 dimension3 metric1 metric2 metric3
0        foo        bar                  1       1
1        foo        bar                  2               2
2                   foo        bar       1       4       2
3        foo        bar     foobar               2       1

Where the output df should be:

  dimension1 dimension2 dimension3  metric1  metric2  metric3
0                   foo        bar        1        4        2
1        foo        bar                   3        1        2
2        foo        bar     foobar      NaN        2        1

Where:

metrics_to_aggregate=['metric1', 'metric2', 'metric3']
dimensions_to_aggregate=['dimension1', 'dimension2', 'dimension3']

Add some example data that represents your problem, mostly 5-10 rows is enough. This way we can visually see what you try to achieve, plus we can copy and use that to reproduce an answer for you. — Erfan, May 29 '19 at 13:42
Thanks @Erfan - Added a very simple example to show the case. — A. Sarid, May 29 '19 at 13:53
Look at [this](https://stackoverflow.com/questions/25962114/how-to-read-a-6-gb-csv-file-with-pandas) question/answers. Read your csv in chunks, process them with your code (dont use `apply(pd.to_numeric)` but use `astype(float)`) — Erfan, May 29 '19 at 15:05
I can not read in chunks as I need to aggregate over all rows. — A. Sarid, May 29 '19 at 15:06
If you can sort first (eg the unix sort command will do this for arbitrarily large files) then you can aggregate in Python with minimal memory requirement. — strubbly, May 29 '19 at 15:32
@strubbly Thanks! Sounds like a great idea. Can you please show me how to do that? — A. Sarid, May 29 '19 at 15:40

score 1 · Answer 1 · answered May 29 '19 at 16:25

If the file is unsorted it is hard to process without using a lot of memory: you need to keep a running aggregation for every key (list of dimension values) that appears in the file. There might be a good way to do that but it depends on details like how many possibilities there are. It might be possible to do the processing in chunks, and then process the chunks together but you're still going to need enough memory to store all the current key values being aggregated against.

A simple and fairly generic solution is to sort first. The unix sort command will happily sort files that are too large to fit into memory. Then the sorted file can easily be processed in chunks. Here's a sequence that shows the principle: you might need to change some details:

First, I extended your file a little to show what's going on and removed the header line (which sort will treat as data):

input_file.csv:
foo bar     1   1   
a       a   9   9   9
z   z       8   8   8
a       a   9   9   9
foo bar     2       2
    foo bar 1   4   2
foo bar foobar      2   1
z   z       7   7   7
a       a   9   9   9

Then I used the command:

sort input_file.csv -o input_file_sorted.csv --key=1,3

This gave me:

input_file_sorted.csv
a       a   9   9   9
a       a   9   9   9
a       a   9   9   9
    foo bar 1   4   2
foo bar     1   1   
foo bar     2       2
foo bar foobar      2   1
z   z       7   7   7
z   z       8   8   8

Then I ran this Python program:

import csv

number_of_dims = 3
number_of_aggs = 3

def aggregate(agg, data):
    for i,d in enumerate(data):
        if d != "":
            agg[i] += int(d)
    return

with open("input_file_sorted.csv", newline="") as f1:
    with open("output_file,csv", "w", newline="") as f2:
        csv_reader = csv.reader(f1, delimiter='\t')
        csv_writer = csv.writer(f2, delimiter='\t')
        key = None
        agg = [0] * number_of_aggs
        for l in csv_reader:
            new_key = l[:number_of_dims]
            if key is None:
                key = new_key
            if key != new_key:
                csv_writer.writerow(key + agg)
                agg = [0] * number_of_aggs
                key = new_key
            aggregate(agg, l[number_of_dims:])
        csv_writer.writerow(key + agg)

and that got me:

output_file.csv:
a       a   27  27  27
    foo bar 1   4   2
foo bar     3   1   2
foo bar foobar  0   2   1
z   z       15  15  15

Hope that helps!

efficiently aggregate large CSV file with multiple int columns

1 Answers1