0

I'm looking for a more efficient way than pd.concat to combine two pandas DataFrames.

I have a large DataFrame (~7GB in size) with the following columns - "A", "B", "C", "D". I want to groupby the frame by "A", then for each group: groupby by "B", average the "C" and sum the "D" and then combine all the results to one dataframe. I've tried the following approaches -

1) Creating an empty final DataFrame, Iterating the groupby of "A" doing the processing I need and than pd.concat each group the the final DataFrame. The problem is that pd.concat is extremely slow.

2) Iterating through the groupby of "A", doing the processing I needed and than saving the result to a csv file. That's working ok but I want to find out if there is a more efficient way that doesn't involve all the I/O of writing to disk.

Code examples

First approach - Final DataFrame with pd.concat:

def pivot_frame(in_df_path):
    in_df = pd.read_csv(in_df_path, delimiter=DELIMITER)
    res_cols = in_df.columns.tolist()
    res = pd.DataFrame(columns=res_cols)
    g = in_df.groupby(by=["A"])
    for title, group in g:
        temp = group.groupby(by=["B"]).agg({"C": np.mean, "D": np.sum})
        temp = temp.reset_index()
        temp.insert(0, "A", title)
        res = pd.concat([res, temp], ignore_index=True)
        temp.to_csv(f, mode='a', header=False, sep=DELIMITER)
    return res

Second approach - Writing to disk:

def pivot_frame(in_df_path, ouput_path):
    in_df = pd.read_csv(in_df_path, delimiter=DELIMITER)
    with open(ouput_path, 'w') as f:
        csv_writer = csv.writer(f, delimiter=DELIMITER)
        csv_writer.writerow(["A", "B", "C", "D"])
        g = in_df.groupby(by=["A"])
        for title, group in g:
            temp = group.groupby(by=["B"]).agg({"C": np.mean, "D": np.sum})
            temp = temp.reset_index()
            temp.insert(0, JOB_TITLE_COL, title)
            temp.to_csv(f, mode='a', header=False, sep=DELIMITER)

The second approach works way faster than the first one but I'm looking for something that would spare me the access to disk all the time. I read about split-apply-combine (e.g. - https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html) but I haven't found it helpful.

Thanks a lot! :)

Guy
  • 63
  • 1
  • 7
  • 2
    Why not groupby with a multiIndex so you can group by ‘A’ and ‘B’ in one go, instead of iterating through the groups of ‘A’? This will also save you the concatenation effort. Do you have some sample data. – Niels Henkens Apr 27 '19 at 07:16
  • Unfortunately I can't supply sample data. Can you elaborate on the grouby multindex? – Guy Apr 27 '19 at 07:23
  • 1
    Have you consdiered Dask? https://docs.dask.org/en/latest/ – run-out Apr 27 '19 at 07:31

1 Answers1

1

Solved

So Niels Henkens comment really helped and the solution is to just -

result = in_df.groupby(by=["A","B"]).agg({"C": np.mean, "D": np.sum})

Another improvement in performance is to use Dask -

import dask.dataframe as dd
df = dd.read_csv(PATH_TO_FILE, delimiter=DELIMITER)
g = df.groupby(by=["A", "B"]).agg({"C": np.mean, "D": np.sum}).compute().reset_index()
Guy
  • 63
  • 1
  • 7