I have a 3 column CSV file where I perform a simple calculation with python and pandas.
The file is very large, just under 4Gb, after the calculation about 1.9Gb
the CSV file is:
data1,data2,data3
aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw97,856521536521321,112535 aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw98,6521321,112138 aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw98,856521536521321,122135 aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw99,521321,112132 aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw99,856521536521321,212135
The calculation is a trivial sum. If column A is identical, then add B and rewrite the CSV. Example result :
data1,data2,data3
aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw97,856521536521321 aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw98,856521543042642 aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw99,856521537042642
import pandas as pd
#Read csv
df = pd.read_csv('data.csv', sep=',' , engine='python')
# Groupby and sum
df_new = df.groupby(["data1"]).agg({"data2": "sum"}).reset_index()
# Save in new file
df_new.to_csv('data2.csv', encoding='utf-8', index=False)
How could I improve the code to speed up execution?
It currently takes about 7 hours on a vps to complete the calculation
add info
The RAM resources are almost always 100% (8Gb), while the choice of the engine = 'python' is because I used a code already present on https://stackoverflow.com/, and honestly I don't know the usefulness or not of that command, but I have seen that the calculation works correctly.
Data3 is actually useless to me (right now, probably useful in the future).