7

I have a Pandas dataframe and I would like to add a new column based on the values of the other columns. A minimal example illustrating my usecase is below.

df = pd.DataFrame([[4,5,19],[1,2,0],[2,5,9],[8,2,5]], columns=['a','b','c'])
df

    a   b   c
---------------
0   4   5   19
1   1   2   0
2   2   5   9
3   8   2   5

x = df.sample(n=2)
x

    a   b   c
---------------
3   8   2   5
1   1   2   0

def get_new(row):
    a, b, c = row
    return random.choice(df[(df['a'] != a) & (df['b'] == b) & (df['c'] != c)]['c'].values)

y = x.apply(lambda row: get_new(row), axis=1)
x['new'] = y
x

    a   b   c   new
--------------------
3   8   2   5   0
1   1   2   0   5

Note: The original dataframe has ~4 million rows and ~6 columns. The number of rows in the sample might vary between 50 and 500. I am running on a 64-bit machine with 8 GB RAM.

The above works, except that it is quite slow (takes about 15 seconds for me). I also tried using x.itertuples() instead of apply and there is not much of an improvement in this case.

  1. It seems that apply(with axis=1) is slow since it does not make use of the vectorized operations. Is there some way I could achieve this in a faster way?

  2. Can the filtering(in the get_new function) be modified or made more efficient compared to using conditional boolean variables, as I currently have?

  3. Can I in some way use numpy here for some speedup?

Edit: df.sample() is also quite slow and I cannot use .iloc or .loc since I am further modifying the sample and do not wish for this to affect the original dataframe.

swathis
  • 336
  • 5
  • 17
  • what about [dask](https://dask.pydata.org/en/latest/) ? – Pierluigi Mar 01 '18 at 15:40
  • @Pierluigi - I am hopeful that there is already a better(and efficient) way to do this in Pandas or Numpy, before looking at other libraries. But thanks for pointing this out. Will certainly have a look. – swathis Mar 01 '18 at 15:50
  • https://stackoverflow.com/a/47149876/6361531 – Scott Boston Mar 01 '18 at 17:26
  • I created *tabel*, as a simplified version of pandas. It's entirely based on numpy but abstracts away the details of dealing with multiple datatypes. and numpy is usually faster in get/set scenarios like yours. https://pypi.org/project/tabel/ – Bastiaan Sep 13 '18 at 05:07

1 Answers1

1

I see a reasonable performance improvement by using .loc rather than chained indexing:

import random, pandas as pd, numpy as np

df = pd.DataFrame([[4,5,19],[1,2,0],[2,5,9],[8,2,5]], columns=['a','b','c'])

df = pd.concat([df]*1000000)

x = df.sample(n=2)

def get_new(row):
    a, b, c = row
    return random.choice(df[(df['a'] != a) & (df['b'] == b) & (df['c'] != c)]['c'].values)

def get_new2(row):
    a, b, c = row
    return random.choice(df.loc[(df['a'] != a) & (df['b'] == b) & (df['c'] != c), 'c'].values)


%timeit x.apply(lambda row: get_new(row), axis=1)   # 159ms
%timeit x.apply(lambda row: get_new2(row), axis=1)  # 119ms
jpp
  • 159,742
  • 34
  • 281
  • 339
  • Tried this, but unfortunately no improvement in my case. It still takes the same amount of time. – swathis Mar 01 '18 at 21:41
  • @swathis. That's interesting. Can you copy-paste my code and confirm whether or not this has to do with your data being different to mine? – jpp Mar 01 '18 at 21:43
  • Your code block works. Using .loc gives me too a speedup for ~50 ms. Could my issue be because I am doing the indexing multiple times within the function based on some other conditions? Also, df.sample is really slow for me. – swathis Mar 01 '18 at 22:45
  • @swathis, yes maybe this isn't even the bottleneck. What I would recommend is split your code into many small functions and run your code through a profiler to pinpoint the cause of the performance drop. You can use `cProfile.run` to do the profiling and it gives output by function. – jpp Mar 02 '18 at 02:11
  • 1
    Using cProfile was useful. There was even an addon for PyCharm which made it easier. Thanks :-) – swathis Mar 06 '18 at 09:06