2

I've recently adopted a new programming style in Pandas, where I have single-responsibility functions that return a Series.

I have found the benefits to this are that with large dataframes (100+ columns), I can take a slice of only the data I need to perform the calculation. It feels like this is more efficient, but I am not sure how Pandas works under the hood.

Does anyone know (a) If this is a recommended style to adopt, and (b) what the memory / time ramifications are for this.

Example:

df = pd.DataFrame({'a': range(0, 10000),
                   'b': range(0, 10000),
                   # etc
                   'z': range(0, 10000)})

def f(df):
    df = df[['a', 'b']].copy()
    df['New Column'] = df['a'] * df['b']
    return df['New Column'].astype(int)

df['New Column'] = f(df)

1 Answers1

0

Doing it your way is slower.

I used the following dataframe

import string
df  = pd.DataFrame({key:range(0, 10000) for key in string.ascii_lowercase})

Then in Jupyter notebook I used the %%timeit cell magic to test how much time it takes to run the following pieces of code:

%%timeit

def f(df):
    df = df[['a', 'b']].copy()
    df['New Column'] = df['a'] * df['b']
    return df['New Column'].astype(int)

df['New Column'] = f(df)

this took 1.05 ms ± 20.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each).

Whereas the following piece of code

%%timeit

df['New Column'] = df['a'] * df['b']
df['New Column'].astype(int)

only took

%%timeit

df['New Column'] = df['a'] * df['b']
df['New Column'].astype(int)

340 µs ± 2.46 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

About 3 times faster!

These two pieces of code are not completly functionally the same (the second piece of code does not have the copy the first one has) when we add a copy into the second piece of code resluting in this:

%%timeit
df2 = df[['a','b']].copy()
df['New Column'] = df2['a'] * df2['b']
df['New Column'].astype(int) 

Which runs in 820 µs ± 1.89 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) Less dramatic difference but still faster than your method. Adding the copy more than doubled the time it takes to compute!

So what happens when we move the definition of the function you use outside of the timing:

we run this piece of code first:

import string
df  = pd.DataFrame({key:range(0, 10000) for key in string.ascii_lowercase})

def f(df):
    df = df[['a', 'b']].copy()
    df['New Column'] = df['a'] * df['b']
    return df['New Column'].astype(int) 

and then this piece of code:

%%timeit

df['New Column'] = f(df)

this runs in 1.03 ms ± 2.02 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each). Still slower than not using your function.

note: Runtimes are dependent on the specs of your computer and other background task running. Mileage may vary.

This was the objective hard facts based part of my answer.

As to the question if your method is the preferred one. I would say no.

1) the way you do it in your example function f is very hard coded.

Your function f only works for column a and b. What if want to do it on column n and m? currently you would need to write a completely new function to process those columns. It would be better if you if you want to stick to this general method to make your code more general. TLDR do not hardcode.

2) you make a copy of the dataframe which does not speed up the final calculations (it does require extra resources though, resulting in extra overhead).

error
  • 2,356
  • 3
  • 23
  • 25
  • I'm aware that copy will serve a performance hit, since I am duplicating the data frame in memory. I do this, however, to avoid the `SettingWithCopyWarning` issue. I don't want to alter a view of the dataframe, but doing a copy() might be doing unnecessary heavy lifting. – Tom Cusack-Huang Jan 10 '18 at 14:18
  • Personally I prefer the accepted answer from this [question](https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas) to solve the settingwithcopywarning. Always using a copy() because sometimes you might get the settingwithcopywarning seems very inefficient to me. Especially when you don’t need to worry about that warning like in your example case. – error Jan 10 '18 at 14:27
  • Agreed. I guess I'm trying to understand more about how Pandas manages the memory behind the scenes. One of my main areas of interest is take a column subset of only the required columns, rather than pull in the whole 100 columns wide dataframe. But I guess, maybe it doesn't matter that non-needed data is going into a function. – Tom Cusack-Huang Jan 10 '18 at 14:58