3

There are many methods for creating new columns in Pandas (I may have missed some in my examples so please let me know if there are others and I will include here) and I wanted to figure out when is the best time to use each method. Obviously some methods are better in certain situations compared to others but I want to evaluate it from a holistic view looking at efficiency, readability, and usefulness.

I'm primarily concerned with the first three but included other ways simply to show it's possible with different approaches. Here's your sample dataframe:

df = pd.DataFrame({'a':[1,2,3],'b':[4,5,6]})

Most commonly known way is to name a new column such as df['c'] and use apply:

df['c'] = df['a'].apply(lambda x: x * 2)
df
   a  b  c
0  1  4  2
1  2  5  4
2  3  6  6

Using assign can accomplish the same thing:

df = df.assign(c = lambda x: x['a'] * 2)
df
   a  b  c
0  1  4  2
1  2  5  4
2  3  6  6

Updated via @roganjosh:

df['c'] = df['a'] * 2
df
   a  b  c
0  1  4  2
1  2  5  4
2  3  6  6

Using map (definitely not as efficient as apply):

df['c'] = df['a'].map(lambda x: x * 2)
df
   a  b  c
0  1  4  2
1  2  5  4
2  3  6  6

Creating a new pd.series and then concat to bring it into the dataframe:

c = pd.Series(df['a'] * 2).rename("c")
df = pd.concat([df,c], axis = 1)
df
   a  b  c
0  1  4  2
1  2  5  4
2  3  6  6

Using join:

df.join(c)
   a  b  c
0  1  4  2
1  2  5  4
2  3  6  6
W Stokvis
  • 1,409
  • 8
  • 15
  • I'd argue that these are _not_ the most common ways. `df['c'] = df['a'] * 2`. Much more efficient than `lambda` because it will be vectorized. – roganjosh Jul 26 '18 at 14:26
  • Don't use `apply` for vectorized operations, such as `* 2`. Just multiply by the series. `df['c'] = 2*df.a` is what you want. No need to complicate – rafaelc Jul 26 '18 at 14:27
  • @RafaelC I've updated the question with that method. Obviously this is a very simple example and there's an optimal way to do add a column here but I'm more interested in other cases where it might not be so obvious. – W Stokvis Jul 26 '18 at 14:31
  • 2
    Sometimes I like to use `assign` when doing on the fly analysis so, I can create a new copy of the dataframe and revert back to the previous copy for trouble shooting. If you just do df['B'] = df['B'] *2 you've modified the data frame inplace. Where if you used df1 = df.assign(b=df['b']*2), you now have a copy. – Scott Boston Jul 26 '18 at 14:31

3 Answers3

3

A succinct way would be:

df['c'] = 2 * df['a']

No need to compute the new column elementwise.

xyzjayne
  • 1,331
  • 9
  • 25
3

Short answer: vectorized calls (df['c'] = 2 * df['a']) almost always win on both speed and readability. See this answer regarding what you can use as a "hierarchy" of options when it comes to performance.


In generally, if you have a for i in ... or lambda present somewhere in a Pandas operation, this (sometimes) means that the resulting calculations call Python code rather than the optimized C code that Pandas' Cython library relies on for vectorized operations. (Same goes for operations that rely on NumPy ufuncs for the underlying .values.)

As for .assign(), it is correctly pointed out in the comments that this creates a copy, whereas you can view df['c'] = 2 * df['a'] as the equivalent of setting a dictionary key/value. The former also takes twice as long, although this is perhaps a bit apples-to-orange because one operation is returning a DataFrame while the other is just assigning a column.

>>> %timeit df.assign(c=df['a'] * 2)
498 µs ± 15.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

>>> %timeit -r 7 -n 1000 df['c'] = df['a'] * 2
239 µs ± 22.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

As for .map(): generally you see this when, as the name implies, you want to provide a mapping for a Series (though it can be passed a function, as in your question). That doesn't mean it's not performant, it just tends to be used as a specialized method in cases that I've seen:

>>> df['a'].map(dict(enumerate('xyz', 1)))
0    x
1    y
2    z
Name: a, dtype: object

And as for .apply(): to inject a bit of opinion into the answer, I would argue it's more idiomatic to use vectorization where possible. You can see in the code for the module where .apply() is defined: because you are passing a lambda, not a NumPy ufunc, what ultimately gets called is technically a Cython function, map_infer, but it is still performing whatever function you passed on each individual member of the Series df['a'], one at a time.

Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
0

Why are you using lambda function? You can easily achieve the above-mentioned task easily by

df['c'] = 2 * df['a']

This will not increase the overhead.

Anidh Singh
  • 302
  • 2
  • 7
  • 19