I have a pandas dataframe which looks like the following (with ~ 1 Million lines):
Column_1 Column_2 Column_3 Column_4 Column_5 Column_6 Column_7 Column_8 Column_9 Column_10
… … … … … … … … … …
… … … … … … … … … …
… … … … … … … … … …
… … … … … … … … … …
I want to do:
grouping = ["Column_1", "Column_2", "Column_3", "Column_4"]
df.groupby(grouping).apply(lambda x: pd.Series({
'new_column_1':func_1(x),
'new_column_2':func_2(x),
'new_column_3':func_3(x)}
)).reset_index()
This works, but is incredibly slow. Functions [func_1, func_2, func_3] are custom functions I want to apply on each of the groups.
I read other stack overflow discussions on why this is so slow. The reason I found is that pandas groupby + apply uses python loops and not vectorization. But then how could I speed this up?
Let's say, for example, that:
def func_1(x) {
return sum(x["Column_5"] >= x["Column_6"]) / sum(x["Column_5"] <= x["Column_6"])
}
def func_2(x) {
return max(x["Column_8"]) + min(x["Column_9"])
}
def func_3(x) {
return len(x)
}
How could we do the same operation without pandas groupby + numpy?