23

I want to improve the time of a groupby in python pandas. I have this code:

df["Nbcontrats"] = df.groupby(['Client', 'Month'])['Contrat'].transform(len)

The objective is to count how many contracts a client has in a month and add this information in a new column (Nbcontrats).

  • Client: client code
  • Month: month of data extraction
  • Contrat: contract number

I want to improve the time. Below I am only working with a subset of my real data:

%timeit df["Nbcontrats"] = df.groupby(['Client', 'Month'])['Contrat'].transform(len)
1 loops, best of 3: 391 ms per loop

df.shape
Out[309]: (7464, 61)

How can I improve the execution time?

ekad
  • 14,436
  • 26
  • 44
  • 46
Náthali
  • 937
  • 2
  • 10
  • 22
  • 4
    I suggest adding the numpy tag. I remember @Divakar coming up with faster solutions than groupby with np.einsum. – ayhan Jul 01 '16 at 11:05
  • 1
    @ayhan, do you mean [this](http://stackoverflow.com/a/36810721/5741205) solution? – MaxU - stand with Ukraine Jul 01 '16 at 11:54
  • @MaxU Yes, it was a very specific case though (easily reshaped). A general solution may not be that straight forward. – ayhan Jul 01 '16 at 11:57
  • 1
    @MaxU Not an expert on pandas and I have a query for you pandas gurus. So, I can imagine what `groupby` does here with `df.groupby(['Client', 'Month'])`. But, then what does selecting/indexing the 'Contrat' column achieve with `['Contrat']`? Or is that indexing at all? From my test results that indexing isn't affecting the final result. Any idea on what's going on there? – Divakar Jul 01 '16 at 12:43
  • 1
    @Divakar, thank you for such a comment, but i'm far, far away from being a pandas guru ;) About your question - the result is the same, but the execution speed will be much faster when using `df.groupby(['Client', 'Month'])['Contrat'].transform(len)` - when i tested it it was almost 6 times faster, compared to: `df.groupby(['Client', 'Month']).transform(len)`. And the more columns has a DF, the slower it will be. – MaxU - stand with Ukraine Jul 01 '16 at 12:54
  • @MaxU But functionally without considering the performance, what exactly is `df.groupby(['Client', 'Month'])['Contrat']` doing? Is it selecting elements from the `'Contrat'` column after grouping? If so, then won't the selection of those 'Contrat' elements affect the final result? But from your comment I sensed that the final result won't be affected if we leave out `'Contrat'` from the equation? So, I am still confused on the use of `'Contrat'`. Sorry if that's a stupid question, `groupby` is a very alienish topic for me :) – Divakar Jul 01 '16 at 13:00
  • 2
    @Divakar Normally, `df.groupby(['Col1', 'Col2'])['Col3']` groups the dataframe by `Col1` and `Col2`, and selects the `Col3` (without aggregation, just the key (Col1, Col2) and value (Col3) pairs). If you do aggregation, say take the mean, it gives you the mean of Col3 for each group. If you don't specify any columns, just `df.groupby(['Col1', 'Col2'])`, it applies the function to all columns (whenever possible). In this example, the OP is using the function len. Since the length of the group will not change from column to column, it is just an auxiliary colum there. – ayhan Jul 01 '16 at 13:07
  • 1
    And if you don't specify a column and you have 10 numeric columns, pandas will return a 10-column dataframe where all columns are the same (all will be the length of the group). – ayhan Jul 01 '16 at 13:12
  • @ayhan Thanks! That makes a lot of sense indeed. Final question if I may : `df.groupby(['Client', 'Month'])['Contrat'].transform(len)` is faster than the version without `['Contrat']` as @ MaxU found out. Is it because the selection of `['Contrat']` does some kinda NumPy like slicing, so it has to process less data and therefore would be faster? Appreciate the patience there :) – Divakar Jul 01 '16 at 13:19
  • 2
    @Divakar, i was about to answer your question, but @ayhan, was faster... :) As ayhan said if we use column selection `['Contrat']` the `transform(len)` will be applied only on that column, otherwise it'll be applied on all columns (available after the groupby operation) – MaxU - stand with Ukraine Jul 01 '16 at 13:24
  • 1
    @MaxU and @ ayhan Perfecto! Makes total sense now. Thanks really for helping out :) – Divakar Jul 01 '16 at 13:28
  • @Divakar When there are only three columns, the performances of `df.groupby(['Col1', 'Col2'])['Col3'].transform(len)` and `df.groupby(['Col1', 'Col2']).transform(len)` should be very similar. If you test it on small datasets, the first will be faster and I believe it is because the second one does some type checking and the time it takes to do that is close to the actual operation's time. But if you increase the length of the dataframe, the ratio will approach to 1. Of course when you have more columns, since you are applying the function to more columns, it will take more time. – ayhan Jul 01 '16 at 13:29
  • 1
    @Divakar Don't mention it, and thank you for your answer. :) – ayhan Jul 01 '16 at 13:32
  • @ayhan Yup, that's what I was missing out on the performance issue. Thanks again for clearing that up for me, very useful info! – Divakar Jul 01 '16 at 13:32

2 Answers2

48

Here's one way to proceed :

  • Slice out the relevant columns (['Client', 'Month']) from the input dataframe into a NumPy array. This is mostly a performance-focused idea as we would be using NumPy functions later on, which are optimized to work with NumPy arrays.

  • Convert the two columns data from ['Client', 'Month'] into a single 1D array, which would be a linear index equivalent of it considering elements from the two columns as pairs. Thus, we can assume that the elements from 'Client' represent the row indices, whereas 'Month' elements are the column indices. This is like going from 2D to 1D. But, the issue would be deciding the shape of the 2D grid to perform such a mapping. To cover all pairs, one safe assumption would be assuming a 2D grid whose dimensions are one more than the max along each column because of 0-based indexing in Python. Thus, we would get linear indices.

  • Next up, we tag each linear index based on their uniqueness among others. I think this would correspond to the keys obtained with grouby instead. We also need to get counts of each group/unique key along the entire length of that 1D array. Finally, indexing into the counts with those tags should map for each element the respective counts.

That's the whole idea about it! Here's the implementation -

# Save relevant columns as a NumPy array for performing NumPy operations afterwards
arr_slice = df[['Client', 'Month']].values

# Get linear indices equivalent of those columns
lidx = np.ravel_multi_index(arr_slice.T,arr_slice.max(0)+1)

# Get unique IDs corresponding to each linear index (i.e. group) and grouped counts
unq,unqtags,counts = np.unique(lidx,return_inverse=True,return_counts=True)

# Index counts with the unique tags to map across all elements with the counts
df["Nbcontrats"] = counts[unqtags]

Runtime test

1) Define functions :

def original_app(df):
    df["Nbcontrats"] = df.groupby(['Client', 'Month'])['Contrat'].transform(len)

def vectorized_app(df):
    arr_slice = df[['Client', 'Month']].values
    lidx = np.ravel_multi_index(arr_slice.T,arr_slice.max(0)+1)
    unq,unqtags,counts = np.unique(lidx,return_inverse=True,return_counts=True)
    df["Nbcontrats"] = counts[unqtags]

2) Verify results :

In [143]: # Let's create a dataframe with 100 unique IDs and of length 10000
     ...: arr = np.random.randint(0,100,(10000,3))
     ...: df = pd.DataFrame(arr,columns=['Client','Month','Contrat'])
     ...: df1 = df.copy()
     ...: 
     ...: # Run the function on the inputs
     ...: original_app(df)
     ...: vectorized_app(df1)
     ...: 

In [144]: np.allclose(df["Nbcontrats"],df1["Nbcontrats"])
Out[144]: True

3) Finally time them :

In [145]: # Let's create a dataframe with 100 unique IDs and of length 10000
     ...: arr = np.random.randint(0,100,(10000,3))
     ...: df = pd.DataFrame(arr,columns=['Client','Month','Contrat'])
     ...: df1 = df.copy()
     ...: 

In [146]: %timeit original_app(df)
1 loops, best of 3: 645 ms per loop

In [147]: %timeit vectorized_app(df1)
100 loops, best of 3: 2.62 ms per loop
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • That's amazing - it's 246 times faster! Could you please add a short explanation for the numpy solution? – MaxU - stand with Ukraine Jul 01 '16 at 12:58
  • 1
    @MaxU Just added some explanation. I did my best there, I generally suck at it though :) – Divakar Jul 01 '16 at 13:55
  • 4
    It's perfect - thank you for teaching me numpy! Unfortunately i can't upvote it more than once ;) – MaxU - stand with Ukraine Jul 01 '16 at 14:03
  • 2
    `np.ravel_multi_index(arr_slice.T,arr_slice.max(0)+1)` returns `TypeError: must be str, not int` – Amin Oct 08 '19 at 18:58
  • Is there a way to get the grouped dataframe back? As in `df_grouped = df.groupby ...` – Dustin Aug 07 '20 at 16:33
  • Hi Divakar, I have a related problem [here](https://stackoverflow.com/questions/70314813/how-to-speed-up-groupby-sum-on-a-dask-dataframe-with-5-millions-of-rows-and). I also read another of your answers [here](https://stackoverflow.com/questions/39053734/improving-the-performance-of-repetitive-groupby-operations), but I'm still stuck at applying your ideas of using numpy. Could you have a look at my question? – Akira Dec 11 '21 at 11:48
5

With the DataFrameGroupBy.size method:

df.set_index(['Client', 'Month'], inplace=True)
df['Nbcontrats'] = df.groupby(level=(0,1)).size()
df.reset_index(inplace=True)

The most work goes into assigning the result back into a column of the source DataFrame.

  • 2
    I don't see how it improves the performance, could compare the run time with the normal groupby? – Amin Oct 08 '19 at 17:54
  • I don't have a concrete benchmark, but in my case a classic groupby ran for (maybe > 1 hour) before finally crashing for out of memory error. But using the solution provided here with the index, ran successfully for about 6 seconds – Alka Feb 04 '21 at 13:13
  • I got ~10x speedup on presorted 500k row dataframe. – Lefty Jan 03 '23 at 16:01