0

I have a pandas dataframe which looks like the following (with ~ 1 Million lines):

Column_1    Column_2    Column_3    Column_4    Column_5    Column_6    Column_7    Column_8    Column_9    Column_10
…           …           …           …           …           …           …           …           …           …
…           …           …           …           …           …           …           …           …           …
…           …           …           …           …           …           …           …           …           …
…           …           …           …           …           …           …           …           …           …

I want to do:

grouping = ["Column_1", "Column_2", "Column_3", "Column_4"]
df.groupby(grouping).apply(lambda x: pd.Series({
              'new_column_1':func_1(x),
              'new_column_2':func_2(x),
              'new_column_3':func_3(x)}
            )).reset_index()

This works, but is incredibly slow. Functions [func_1, func_2, func_3] are custom functions I want to apply on each of the groups.

I read other stack overflow discussions on why this is so slow. The reason I found is that pandas groupby + apply uses python loops and not vectorization. But then how could I speed this up?

Let's say, for example, that:

def func_1(x) {
     return sum(x["Column_5"] >= x["Column_6"]) / sum(x["Column_5"] <= x["Column_6"])
}

def func_2(x) {
         return max(x["Column_8"]) + min(x["Column_9"])
    }

def func_3(x) {
         return len(x)
    }

How could we do the same operation without pandas groupby + numpy?

  • `groupby` + `apply` defaults to a loop over the groups. Instead you want to re-write your aggregations in terms of the built in groupby aggregations, which are all optimized in `cython`. For instance, `func_3` is just `GroupBy.size()`, and `fun_2` can be written as two separate `GroupBy.min() + GroupBy.max`. Two separate groupbys will be faster as the `.groupby()` doesn't really _do_ anything: https://stackoverflow.com/questions/63306888/general-groupby-in-python-pandas-fast-way/63307380#63307380 – ALollz Jun 29 '21 at 14:31
  • "vectorization" means performing the loops in compiled code. That requires knowing enough of either `pandas` or `numpy` to know which operations (usually `methods`) are these fast(er) building blocks. I don't know much about `pandas` internals, but for `numpy` the faster stuff works with the whole numeric array in a "parallel" fashion. It does iterate at the low level, but you, as user, doesn't care about the order. – hpaulj Jun 29 '21 at 15:10
  • @ALollz I actually made up fun_2 and fun_3, but what I really need is fun_1 in this example. Is there a way to do that with pandas groupby aggregations? Otherwise, I am ok with using numpy, but how do I replicate the groupby logic? – Antonio Paladini Jun 29 '21 at 15:23
  • `numpy` doesn't have `groupby` tools. – hpaulj Jun 29 '21 at 22:01

1 Answers1

0

Looks like you want to compare the values of 2 different columns in each row and then tally the results of the row by row comparisons, then do math on the tallies. If so, make 2 new columns that have the results of the comparisons, then sum those new columns and compare the numbers. Vectorization rather than iteration. See this toy example:

row1list = [1, 2]
row2list = [5, 3]
row3list = [5, 4]
row4list = [5, 5]
df = pd.DataFrame([row1list, row2list, row3list, row4list],
                  columns=['Column_5', 'Column_6'])

df[['col5 >= col6', 'col6 <= col5']] = 0, 0  
# start with 0, else you get nan or 1 in the next comparison

df.loc[df['Column_5'] >= df['Column_6'], 'col5 >= col6'] = 1
df.loc[df['Column_5'] <= df['Column_6'], 'col6 <= col5'] = 1
print(df)
#    Column_5  Column_6  col5 >= col6  col6 <= col5
# 0         1         2             0             1
# 1         5         3             1             0
# 2         5         4             1             0
# 3         5         5             1             1

answer_of_func1 = sum(df['col5 >= col6']) / sum(df['col6 <= col5'])
print(answer_of_func1)
# 1.5
Hammurabi
  • 1,141
  • 1
  • 4
  • 7