I have a Pandas dataframe and I would like to add a new column based on the values of the other columns. A minimal example illustrating my usecase is below.
df = pd.DataFrame([[4,5,19],[1,2,0],[2,5,9],[8,2,5]], columns=['a','b','c'])
df
a b c
---------------
0 4 5 19
1 1 2 0
2 2 5 9
3 8 2 5
x = df.sample(n=2)
x
a b c
---------------
3 8 2 5
1 1 2 0
def get_new(row):
a, b, c = row
return random.choice(df[(df['a'] != a) & (df['b'] == b) & (df['c'] != c)]['c'].values)
y = x.apply(lambda row: get_new(row), axis=1)
x['new'] = y
x
a b c new
--------------------
3 8 2 5 0
1 1 2 0 5
Note: The original dataframe has ~4 million rows and ~6 columns. The number of rows in the sample might vary between 50 and 500. I am running on a 64-bit machine with 8 GB RAM.
The above works, except that it is quite slow (takes about 15 seconds for me). I also tried using x.itertuples()
instead of apply
and there is not much of an improvement in this case.
It seems that apply(with axis=1) is slow since it does not make use of the vectorized operations. Is there some way I could achieve this in a faster way?
Can the filtering(in the
get_new
function) be modified or made more efficient compared to using conditional boolean variables, as I currently have?Can I in some way use numpy here for some speedup?
Edit: df.sample()
is also quite slow and I cannot use .iloc
or .loc
since I am further modifying the sample and do not wish for this to affect the original dataframe.