Update
Revisiting this, DataFrame.apply
is slow AF. Let's look at some other options and then compare.
Other options to DataFrame.apply
numpy.where
This method can be applied when we only have two options. In your case, this is true as we return df.a
when df.a == df.b
or df.a == 'fruit' and df.b == 'both'
. The syntax is np.where(condition, value_if_true, value_if_false)
.
In [42]: df['np_where'] = np.where(
...: ((df.a == df.b) | ((df.a == 'fruit') & (df.b == 'both'))),
...: df.a,
...: 'Unknown'
...: )
numpy.select
You would use this option if you had multiple conditions. The syntax for this is np.select(condition, values, default)
where default
is an optional parameter.
In [43]: conditions = df.a == df.b, (df.a == 'fruit') & (df.b == 'both')
In [44]: choices = df['a'], df['a']
In [45]: df['np_select'] = np.select(conditions, choices, default='Unknown')
Note that for the purposes of demonstration I have created two conditions even if the outcomes yield the same result.
Comparing the options
As you can see, all three methods have the same outcome.
In [47]: df
Out[47]:
a b np_where np_select df_apply
0 fruit fruit fruit fruit fruit
1 vegetable vegetable vegetable vegetable vegetable
2 vegetable both Unknown Unknown Unknown
3 fruit both fruit fruit fruit
But how do they compare in terms of speed? To check this, let's create a newer, larger DataFrame
. We're doing this to see how our options perform with larger amounts of data.
In [48]: df_large = pd.DataFrame({
...: 'a': np.random.choice(['fruit', 'vegetable'], size=1_000_000),
...: 'b': np.random.choice(['fruit', 'vegetable', 'both'], size=1_000_000)
...: })
In [49]: %timeit df_large['np_where'] = np.where(((df_large.a == df_large.b) | ((df_large.a == 'fruit')
...: & (df_large.b == 'both'))), df_large.a, 'Unknown')
379 ms ± 64.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [50]: %timeit df_large['np_select'] = np.select(((df_large.a == df_large.b), ((df_large.a == 'fruit'
...: ) & (df_large.b == 'both'))), (df_large.a, df_large.a), default='Unknown')
580 ms ± 101 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [51]: %timeit df_large['df_apply'] = df_large.apply(total, axis=1)
40.5 s ± 6 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Wow! As you can see DataFrame.apply
is much slower than our other two options and np.where
edges out np.select
.
Conclusions
- Use
np.where
if you have only two choices
- Use
np.select
if you have multiple choices
- Don't use
DataFrame.apply
(especially for large data sets)!
Resources
Create your own function and use DataFrame.apply
In [104]: def total(r):
...: if r.a == r.b:
...: return r.a
...: elif (r.a == 'fruit') and (r.b == 'both'):
...: return r.a
...: return 'Unknown'
...:
In [105]: df = pd.DataFrame({'a': ['fruit', 'vegetable', 'vegetable', 'fruit'], 'b': ['fruit', 'vegetable', 'both', 'both']})
In [106]: df
Out[106]:
a b
0 fruit fruit
1 vegetable vegetable
2 vegetable both
3 fruit both
In [107]: df['total'] = df.apply(total, axis=1)
In [108]: df
Out[108]:
a b total
0 fruit fruit fruit
1 vegetable vegetable vegetable
2 vegetable both Unknown
3 fruit both fruit