0

I know there is more than one way to approach this and get the job done. Are there any considerations other than performance when choosing whether to use Apply Lambda? I have a particularly large dataframe with a column of emails, and I have need to strip the '@domain' from all of them. There is the simple:

DF['PRINCIPAL'] = DF['PRINCIPAL'].str.split("@", expand=True)[0]

and then the Apply Lambda:

DF['PRINCIPAL'] = DF.apply(lambda x: x['PRINCIPAL'].str.split("@", expand=True)[0]

I assume they are roughly equivalent, but they're method of execution will mean they are each more efficient in certain situations. Is there anything I should know?

  • 1
    Avoid both.... simple list comprehension will most likely outperform using .str accessor. – Scott Boston Aug 12 '19 at 20:48
  • 1
    AFAIK lambda is better than a `for` loop but worse than vectorized operations. Avoid it if possible, only use it when there's no other option. Having said that I'm not sure how does `lambda` fair against `.str` methods – Juan C Aug 12 '19 at 20:50
  • @ScottBoston I'm not sure how to strip the last part of the email using list comprehension. Can you point to any documentation I can look at or provide an example? – Drew Aschenbrener Aug 12 '19 at 20:54
  • 2
    `DF['PRINCIPAL'] = [x.strip('@')[0] for x in DF['PRINCIPAL']]` – Quang Hoang Aug 12 '19 at 20:56
  • 2
    And you can see [this question](https://stackoverflow.com/questions/54028199/are-for-loops-in-pandas-really-bad-when-should-i-care) for a test where list comprehension is faster than `.str` accessor. – Quang Hoang Aug 12 '19 at 20:58
  • @QuangHoang Thank you for the link! So while vectorized approaches are normally preferable, when acting on strings, list comprehension seems to be preferable. The only thing that isn't clear to me is why the strip() function is better than str.split(). Aren't they essentially doing the same thing? – Drew Aschenbrener Aug 12 '19 at 21:05
  • 1
    .str access is very loopy inside the source code plus the overhead of pandas. In most situations, list comprehension outperforms using the .str accessor. – Scott Boston Aug 12 '19 at 21:10

2 Answers2

2

Use:

df = pd.DataFrame({'email':['abc@ABC.com']*1000})

s1 = df['email'].str.split('@').str[0]

s2 = pd.Series([i.split('@')[0] for i in df['email']], name='email')

s1.eq(s2).all()

Output

True

Timings:

%timeit s1 = df['email'].str.split('@').str[0]

1.77 ms ± 75.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit s2 = pd.Series([i.split('@')[0] for i in df['email']], name='email')

737 µs ± 67.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Scott Boston
  • 147,308
  • 15
  • 139
  • 187
0

You can use assign which is the method recommended by Marc Garcia in his talk toward pandas 1.0 because you can chain operations on the same dataframe see example between 6:17 and 7:30:

DF = DF.assign(PRINCIPAL=lambda x: x['PRINCIPAL'].str.split("@", expand=True)[0])
ndclt
  • 2,590
  • 2
  • 12
  • 26