I have a large CSV file with some string type columns (dtype object
) and other columns that are int64
type.
Both the string columns and the integer columns can be empty in the CSV. An empty value in the integer columns represent zero and an empty string should stay an empty string.
I want to aggregate all the integer columns over all the other columns. What is the best approach for this?
The output should finally be a single CSV file that is aggregated. The file might not fit into memory as a whole, which is why I tried to use Dask but I am having a hard time understanding how to do this.
This is a working pandas code for a small CSV that fits into memory:
import pandas as pd
df = pd.read_csv("input_file.csv", na_filter=False)
df[metrics_to_aggregate] = df[metrics_to_aggregate].fillna(0).apply(pd.to_numeric)
df = df.groupby(dimensions_to_aggregate, as_index=False).sum()
df.to_csv("output_file.csv", index=False)
For example,input df
can be something like:
dimension1 dimension2 dimension3 metric1 metric2 metric3
0 foo bar 1 1
1 foo bar 2 2
2 foo bar 1 4 2
3 foo bar foobar 2 1
Where the output df
should be:
dimension1 dimension2 dimension3 metric1 metric2 metric3
0 foo bar 1 4 2
1 foo bar 3 1 2
2 foo bar foobar NaN 2 1
Where:
metrics_to_aggregate=['metric1', 'metric2', 'metric3']
dimensions_to_aggregate=['dimension1', 'dimension2', 'dimension3']