1

My goal is to perform some basic calculation with the first occurring row and assign it to a new column in dataframe.

For simple example:

df = pd.DataFrame({k: np.random.randint(0, 1000, 100) for k in list('ABCDEFG')})

# drop duplicates 
first = df.drop_duplicates(subset='A', keep='first').copy()
%timeit first['H'] = first['A']*first['B'] + first['C'] - first['D'].max()

this gives

532 µs ± 5.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

if I reset the index it becomes almost x2 faster (just in case the difference is due to some caching, I rerun with different order multiple times, it gave same result)

# drop duplicates but reset index
first = df.drop_duplicates(subset='A', keep='first').reset_index(drop=True).copy()
%timeit  first['H'] = first['A']*first['B'] + first['C']

342 µs ± 7.47 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Although it's not that bag difference, I wonder what causes this. Thanks.

UPDATE:

I redid this simple test, the issue was not index related, it seems like have something to do with the copy of a dataframe:

In [1]: import pandas as pd
In [2]: import numpy as np

In [3]: df = pd.DataFrame({k: np.random.randint(0, 1000, 100) for k in list('ABCDEFG')})

In [4]: # drop duplicates
   ...: first = df.drop_duplicates(subset='A', keep='first').copy()
   ...: %timeit first['H'] = first['A']*first['B'] + first['C'] - first['D'].max()
558 µs ± 11.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: # drop duplicates
   ...: first = df.drop_duplicates(subset='A', keep='first')
   ...: %timeit first['H'] = first['A']*first['B'] + first['C'] - first['D'].max()
/Users/sam/anaconda3/bin/ipython:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  #!/Users/sam_dessa/anaconda3/bin/python
20.7 ms ± 826 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

making a copy and assign a new column took ~ 532 µs but directly operate on the dataframe itself (which pandas also gave warning) gave 20.7 ms, same original question, what is causing this? Is it simply because the time spent on throwing out the warning?

Sam
  • 475
  • 1
  • 7
  • 19
  • 1
    why your second `%timeit` don't have `- first['D'].max()`? – Terry Mar 01 '19 at 17:27
  • ahhh... my apologies, i made the silly mistake, it was a typo, after adding that back it turns out to be the same performance. I'm not sure why though, I was expecting them to be different, because I'm using this to simulate a real data setting where I found reset_index gives faster speed. Let me see if I can replicate that. – Sam Mar 01 '19 at 17:32
  • @Terry Thanks for spotting the mistake, please see the updates. – Sam Mar 01 '19 at 22:10
  • I can not answer your question, but I discovered this [explanation](http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#why-does-assignment-fail-when-using-chained-indexing) on the pandas API. This [thread](https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas/53954986#53954986) looks promising too. I hope I have helped you by finding a north in your question. – Terry Mar 01 '19 at 22:50

0 Answers0