Efficient python pandas equivalent/implementation of R sweep with multiple arguments

Question

Other questions attempting to provide the python equivalent to R's sweepfunction (like here) do not really address the case of multiple arguments where it is most useful.

Say I wish to apply a 2 argument function to each row of a Dataframe with the matching element from a column of another DataFrame:

df = data.frame("A" = 1:3,"B" = 11:13)
df2= data.frame("X" = 10:12,"Y" = 10000:10002)
sweep(df,1, FUN="*",df2$X)

In python I got the equivalent using apply on what is basically a loop through the row counts.

df = pd.DataFrame( { "A" : range(1,4),"B" : range(11,14) } )
df2 = pd.DataFrame( { "X" : range(10,13),"Y" : range(10000,10003) } )
pd.Series(range(df.shape[0])).apply(lambda row_count: np.multiply(df.iloc[row_count,:],df2.iloc[row_count,df2.columns.get_loc('X')]))

I highly doubt this is efficient in pandas, what is a better way of doing this?

Both bits of code should result in a Dataframe/matrix of 6 numbers when applying *:

I should state clearly that the aim is to insert one's own function into this sweep like behavior say:

df = data.frame("A" = 1:3,"B" = 11:13)
df2= data.frame("X" = 10:12,"Y" = 10000:10002)
myFunc = function(a,b) { floor((a + b)^min(a/2,b/3))  }
sweep(df,1, FUN=myFunc,df2$X)

resulting in:

 A B
[1,] 3 4
[2,] 3 4
[3,] 3 5

What is a good way of doing that in python pandas?

I don't get why you need `sweep` here, because you are just doing `df*df2$X`. Could you give an example where you catually need `sweep`? — denis, Feb 10 '19 at 21:39
I amended the question. Thanks for the comment, the multiplication was not the focus of the question. — crogg01, Feb 10 '19 at 21:57
What is actually your question? Sure, you can use `apply`, [but I wouldn't recommend it's use unless you're sure there's no better option](https://stackoverflow.com/questions/54432583/when-should-i-ever-want-to-use-pandas-apply-in-my-code). Now, what is "better" really depends on the function you're implementing. Without more context, your problem cannot be satisfactorily addressed as an answer, sorry. — cs95, Feb 13 '19 at 02:20
If you are sure an iterative solution is the only one, [there are better alternatives using loops and comprehensions](https://stackoverflow.com/questions/54028199/for-loops-with-pandas-when-should-i-care), not to mention numba and cython for JIT compilation and better performance. So, _what is actually your question_? — cs95, Feb 13 '19 at 02:21
It would instead be better to _update_ your question to add more information about your actual problem. — cs95, Feb 14 '19 at 06:07

Prodipta Ghosh · Accepted Answer · 2019-02-14T15:07:44.523

If I understand this correctly, you are looking to apply a binary function f(x,y) to a dataframe (for the x) row-wise with arguments from a series for y. One way to do this is to borrow the implementation from pandas internals itself. If you want to extend this function (e.g. apply along columns, it can be done in a similar manner, as long as f is binary. If you need more arguments, you can simply do a partial on f to make it binary

import pandas as pd
from pandas.core.dtypes.generic import ABCSeries

def sweep(df, series, FUN):
    assert isinstance(series, ABCSeries)

    # row-wise application
    assert len(df) == len(series)
    return df._combine_match_index(series, FUN)


# define your binary operator
def f(x, y):
    return x*y    

# the input data frames
df = pd.DataFrame( { "A" : range(1,4),"B" : range(11,14) } )
df2 = pd.DataFrame( { "X" : range(10,13),"Y" : range(10000,10003) } )

# apply
test1 = sweep(df, df2.X, f)

# performance
# %timeit sweep(df, df2.X, f)
# 155 µs ± 1.27 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)#

# another method
import numpy as np
test2 = pd.Series(range(df.shape[0])).apply(lambda row_count: np.multiply(df.iloc[row_count,:],df2.iloc[row_count,df2.columns.get_loc('X')]))

# %timeit performance
# 1.54 ms ± 56.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

assert all(test1 == test2)

Hope this helps.

score 1 · Answer 2 · answered Feb 10 '19 at 21:26

1

In pandas

df.mul(df2.X,axis=0)
    A    B
0  10  110
1  22  132
2  36  156

answered Feb 10 '19 at 21:26

BENY

317,841
20
164
234

very good +1, but what if it is not a multiplication (i.e. not a built-in on pandas.Dataframe ?) – crogg01 Feb 10 '19 at 21:28
@HansRoggeman for example ? BTW pandas is index sensitive which mean you can assign the value between two dataframe , match index will have value , if not match will return NaN – BENY Feb 10 '19 at 21:28
OK, say i wanted to multiply both numbers with a different uniform random number and then get the closest prime to the sum of the two products? but really anything not built-in is good... just looking for the ability to plug in by own multi-arg function in this `sweep`-like ability. – crogg01 Feb 10 '19 at 21:30
@HansRoggeman for example `n=range(df.shape[0]) `;`np.multiply(df.iloc[n,:],df2.iloc[n,:]['X'])` – BENY Feb 10 '19 at 21:39
That is pretty close to what I had. It might not really be possible to do this in pandas, I will accept the answer if nothing comes up after setting a bounty. Thank you! – crogg01 Feb 10 '19 at 22:01

Efficient python pandas equivalent/implementation of R sweep with multiple arguments

2 Answers2