Exaggerated calculation times with pandas and csv

Question

I have a 3 column CSV file where I perform a simple calculation with python and pandas.

The file is very large, just under 4Gb, after the calculation about 1.9Gb

the CSV file is:

data1,data2,data3

aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw97,856521536521321,112535 aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw98,6521321,112138 aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw98,856521536521321,122135 aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw99,521321,112132 aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw99,856521536521321,212135

The calculation is a trivial sum. If column A is identical, then add B and rewrite the CSV. Example result :

data1,data2,data3

aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw97,856521536521321 aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw98,856521543042642 aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw99,856521537042642

import pandas as pd
#Read csv
df = pd.read_csv('data.csv', sep=',' , engine='python')

# Groupby and sum
df_new = df.groupby(["data1"]).agg({"data2": "sum"}).reset_index()

# Save in new file 
df_new.to_csv('data2.csv', encoding='utf-8', index=False)

How could I improve the code to speed up execution?

It currently takes about 7 hours on a vps to complete the calculation

add info

The RAM resources are almost always 100% (8Gb), while the choice of the engine = 'python' is because I used a code already present on https://stackoverflow.com/, and honestly I don't know the usefulness or not of that command, but I have seen that the calculation works correctly.

Data3 is actually useless to me (right now, probably useful in the future).

There are 3 columns, yet the group by only looks at `data1`, aggregates `data2`, so what about `data3`, which is also in the output? — DocZerø, Mar 21 '22 at 08:01
Split your large csv file into multiple smaller csv files using https://stackoverflow.com/questions/20721120/how-to-split-csv-files-as-per-number-of-rows-specified and then map-reduce your group by and sum operation. Get fancy and use multiprocessing to speed up without exploding your machine. — Afaq, Mar 21 '22 at 08:08
Hold your horses @Afaq, maybe there is a simpler solution? Your suggestion implies that @Minatorecoin is hitting the swap, are you sure that is the case? Maybe it's the `engine='python'` thing? — psarka, Mar 21 '22 at 08:13
@Minatorecoin Why are you using `engine='python'`? And can you check if you are running out of RAM? — psarka, Mar 21 '22 at 08:14
Totally agree with you @psarka. Minatorecoin, could you provide your resource usage and which operation is the slowest? — Afaq, Mar 21 '22 at 08:23

westandskif · Answer 1 · 2022-03-22T07:12:28.997

There's an alternative option - use convtools for this. It is a pure python library which generates pure python code to build ad hoc converters. Of course bare python cannot beat pandas in terms of speed, but at least it doesn't need any wrappers and it works just like you'd implement everything by hand.

So, normally the following would work for you:

from convtools import conversion as c
from convtools.contrib.tables import Table


# you can store the converter somewhere for further reuse
converter = (
    c.group_by(c.item("data1"))
    .aggregate({
        "data1": c.item("data1"),
        "data2": c.ReduceFuncs.Sum(c.item("data2"))
    })
    .gen_converter()
)

# this is an iterable (stream of rows), not the list
rows = Table.from_csv("tmp4.csv", header=True).into_iter_rows(dict)

Table.from_rows(converter(rows)).into_csv("out.csv")

JFYI: If you run the script manually, then you can monitor the speed using e.g. tqdm, just wrap an iterable you are consuming with it:

from tqdm import tqdm

# same code as above, except for the last line:
Table.from_rows(converter(tqdm(rows))).into_csv("out.csv")

HOWEVER: the solution above doesn't require an input file to fit into memory, but the result should. In your case, if the result is 1.9GB csv file, it is unlikely to fit corresponding python objects into 8GB of RAM.

Then you may need to:

remove the header: tail -n +2 raw_file.csv > raw_file_no_header.csv
pre-sort the file sort raw_file_no_header.csv > sorted_file.csv
a then:

from convtools import conversion as c
from convtools.contrib.tables import Table

converter = (
    c.chunk_by(c.item("data1"))
    .aggregate(
        {
            "data1": c.ReduceFuncs.First(c.item("data1")),
            "data2": c.ReduceFuncs.Sum(c.item("data2")),
        }
    )
    .gen_converter()
)
rows = Table.from_csv("sorted_file.csv", header=True).into_iter_rows(dict)
Table.from_rows(converter(rows)).into_csv("out.csv")

This only requires a single group to fit into memory.

score 1 · Accepted Answer · answered Mar 21 '22 at 11:42

Remove the engine='python', it does no good.
Get more RAM, 8GB is not enough, you should never hit 100% (this is what slows you down)
(it is too late now), but don't use .csv files for large datasets. Look into feather or parquet.

If you can't get more RAM, then maybe @Afaq will elaborate on the file splitting approach. The problem I see there, is that you are not reducing your dataset much, so map reduce may choke on the reduce part, unless you split your file in such a way, that same data1 strings would always go into the same file.

I just had to edit engine='python' in engine='c'. I reduced the execution to about 40 minutes. Thank you all — Minatorecoin, Mar 22 '22 at 12:13

Exaggerated calculation times with pandas and csv

2 Answers2