14

Update:

The pandas df was created like this:

df = pd.read_sql(query, engine)
encoded = pd.get_dummies(df, columns=['account'])

Creating a dask df from this df looks like this:

df = dd.from_pandas(encoded, 50)

Performing the operation with dask results in no visible progress being made (checking with dask diagnostics):

result = df.groupby('journal_entry').max().reset_index().compute()

Original:

I have a large pandas df with 2.7M rows and 4,000 columns. All but four of the columns are of dtype uint8. The uint8 columns only hold values of 1 or 0. I am attempting to perform this operation on the df:

result = df.groupby('id').max().reset_index()

Predictably, this operation immediately returns a memory error. My initial thought is to chunk the df both horizontally and vertically. However, this creates a messy situation, since the .max() needs to be performed across all the uint8 columns, not just a pair of columns. In addition, it is still extremely slow to chunk the df like this. I have 32 GB of RAM on my machine.

What strategy could mitigate the memory issue?

OverflowingTheGlass
  • 2,324
  • 1
  • 27
  • 75

3 Answers3

42

If you have any categorical columns in your data (rather than categories stored as object columns or strings), make sure you use the observed=True option in your groupby command. This makes sure it only creates lines where an entry is present, e.g. only one line per customer_id,order_id combination, rather than creating n_custs * n_orders lines!

I just did a groupby-sum on a 26M row dataset, never going above 7GB of RAM. Before adding the observed=True option, it was going up to 62GB and then running out.

Heather Walker
  • 541
  • 4
  • 3
  • 2
    You saved my bacon! It literally went from hours to milliseconds! Do you know the reason for them not setting this as default? – Daniel O Aug 01 '19 at 05:23
  • 1
    I experienced large increase in memory consumption while updating the pandas package version from 0.23.3 to 0.25.1. I assume this flag was enabled by default in earlier versions. – username Aug 28 '20 at 15:19
  • You can go in and upvote the issue on changing the default https://github.com/pandas-dev/pandas/issues/43999 – Peter Mølgaard Pallesen Nov 25 '22 at 09:32
8

you could use dask.dataframe for this task

import dask.dataframe as dd
df = dd.from_pandas(df)
result = df.groupby('id').max().reset_index().compute()

All you need to do is convert your pandas.DataFrame into a dask.dataframe. Dask is a python out-of-core parallelization framework that offers various parallelized container types, one of which is the dataframe. It let's you perform most common pandas.DataFrame operations in parallel and/or distributed with data that is too large to fit in memory. The core of dask is a set of schedulers and an API for building computation graphs, hence we have to call .compute() at the end in order for any computation to actually take place. The library is easy to install because it is written in pure python for the most part.

tobsecret
  • 2,442
  • 15
  • 26
  • I read about dask a long time ago, but promptly forgot it...seems to be working so far. No memory error and my usage isn't climbing. I did have to add a numpartitions or chunksize...I just put `dd.from_pandas(df,1000)` - wasn't sure what was appropriate. Any estimate for how long that amount of data might take? Should I change 1000 to a different value? – OverflowingTheGlass Apr 26 '18 at 20:44
  • 1
    [dask.diagnostics](http://dask.pydata.org/en/latest/diagnostics-local.html?highlight=progressbar#) has a progressbar which should be the perfect tool for figuring out how long the computation should take. Groupby operations are notoriously slow in dask, so you may be able to speed the whole operation up by setting 'id' as the index before doing the groupby. – tobsecret Apr 26 '18 at 21:44
  • 1
    Also not sure if you can perform the groupby in pandas and then only have dask handle the rest of the computation. Groupby and index operations are slow in dask since it has to set the index for each partition and then communicate with the other partitions (same for groupby). So generally if you have fewer partitions, these operations should be faster but don't quote me on that. – tobsecret Apr 26 '18 at 21:46
  • Thank you very much - this is all fantastic information. When you say "communicate with the other partitions", the `.max()` function would be calculated across partitions, correct? I couldn't tell from the docs. Thanks again - planning on making another run at it this morning. – OverflowingTheGlass Apr 27 '18 at 12:16
  • Also, isn't the `.max()` required for the groupby? Otherwise the groupby wouldn't have anything to group. – OverflowingTheGlass Apr 27 '18 at 12:27
  • Yes, the max and for that matter any normal pandas functions work across all partitions, you don't have to worry about it. As for groupby does not really require any computation to happen after it (you could for example manually loop through the groups in a groupby object in pandas) and groupby is much faster if your dataset fits in memory. But I would just try it the way I had it in my answer and add a progressbar to keep track of how long you need to wait. – tobsecret Apr 27 '18 at 22:30
  • so I tried that, and left it to run for 2 hours...with 0% completed according to the progress bar – OverflowingTheGlass Apr 28 '18 at 14:19
  • Can you try with 50 partitions? – tobsecret Apr 28 '18 at 18:28
  • I tried that - same result! Hours of no progress. Also tried converting the pandas df to parquet, and then reading in the parquet file with dask. No luck with that either. – OverflowingTheGlass Apr 30 '18 at 14:04
  • I added the code used to create the initial df to my question as well. – OverflowingTheGlass Apr 30 '18 at 14:18
  • I am not sure what the problem might be. Probably a good idea to front this to the dask devs: https://stackoverflow.com/users/616616/mrocklin https://stackoverflow.com/users/3821154/mdurant – tobsecret Apr 30 '18 at 17:51
  • I appreciate the help - I will reach out to them. Right now, I have split my data into smaller chunks and run the dask computation on each one alone. – OverflowingTheGlass Apr 30 '18 at 18:14
  • I've been using dask for years and this doesn't solve the memory issue for me. – Matt Elgazar Jun 07 '21 at 01:30
1

As an idea i would say, splitting the data column wise let's say four times, and use the id for each subset to perform the operations and then remerge

klido
  • 98
  • 1
  • 7