I'm looking for a more efficient way than pd.concat to combine two pandas DataFrames.
I have a large DataFrame (~7GB in size) with the following columns - "A", "B", "C", "D". I want to groupby the frame by "A", then for each group: groupby by "B", average the "C" and sum the "D" and then combine all the results to one dataframe. I've tried the following approaches -
1) Creating an empty final DataFrame, Iterating the groupby of "A" doing the processing I need and than pd.concat each group the the final DataFrame. The problem is that pd.concat is extremely slow.
2) Iterating through the groupby of "A", doing the processing I needed and than saving the result to a csv file. That's working ok but I want to find out if there is a more efficient way that doesn't involve all the I/O of writing to disk.
Code examples
First approach - Final DataFrame with pd.concat:
def pivot_frame(in_df_path):
in_df = pd.read_csv(in_df_path, delimiter=DELIMITER)
res_cols = in_df.columns.tolist()
res = pd.DataFrame(columns=res_cols)
g = in_df.groupby(by=["A"])
for title, group in g:
temp = group.groupby(by=["B"]).agg({"C": np.mean, "D": np.sum})
temp = temp.reset_index()
temp.insert(0, "A", title)
res = pd.concat([res, temp], ignore_index=True)
temp.to_csv(f, mode='a', header=False, sep=DELIMITER)
return res
Second approach - Writing to disk:
def pivot_frame(in_df_path, ouput_path):
in_df = pd.read_csv(in_df_path, delimiter=DELIMITER)
with open(ouput_path, 'w') as f:
csv_writer = csv.writer(f, delimiter=DELIMITER)
csv_writer.writerow(["A", "B", "C", "D"])
g = in_df.groupby(by=["A"])
for title, group in g:
temp = group.groupby(by=["B"]).agg({"C": np.mean, "D": np.sum})
temp = temp.reset_index()
temp.insert(0, JOB_TITLE_COL, title)
temp.to_csv(f, mode='a', header=False, sep=DELIMITER)
The second approach works way faster than the first one but I'm looking for something that would spare me the access to disk all the time. I read about split-apply-combine (e.g. - https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html) but I haven't found it helpful.
Thanks a lot! :)