How to speed up pandas string function?

Question

I am using the pandas vectorized str.split() method to extract the first element returned from a split on "~". I also have also tried using df.apply() with a lambda and str.split() to produce equivalent results. When using %timeit, I'm finding that df.apply() is performing faster than the vectorized version.

Everything that I have read about vectorization seems to indicate that the first version should have better performance. Can someone please explain why I am getting these results? Example:


     id     facility      
0   3466     abc~24353  
1   4853     facility1~3.4.5.6   
2   4582     53434_Facility~34432~cde   
3   9972     facility2~FACILITY2~343
4   2356     Test~23 ~FAC1

The above dataframe has about 500,000 rows and I have also tested at around 1 million with similar results. Here is some example input and output:

Vectorization

In [1]: %timeit df['facility'] = df['facility'].str.split('~').str[0]
1.1 s ± 54.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Lambda Apply

In [2]: %timeit df['facility'] = df['facility'].astype(str).apply(lambda facility: facility.split('~')[0])
650 ms ± 52.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Does anyone know why I am getting this behavior?
Thanks!

Why do you think `str.split` is vectorised? Vectorised for Pandas / Numpy usually means contiguous memory blocks. `df['facility']` is of type `object`, which is just a bunch of pointers. — jpp, Jun 07 '18 at 14:59
I thought it was because of this website: https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html — ddx, Jun 07 '18 at 15:05
I think the website is being very generous with the term "vectorised". — jpp, Jun 07 '18 at 15:07
And the reason I was trying to improve this lambda function is because I ran into a previous lambda that used groupby and filter to filter out rows on a condition. When I changed this to remove these rows using Boolean Indexing I saw a major performance improvement. — ddx, Jun 07 '18 at 15:07

cs95 · Accepted Answer · 2019-01-13T21:10:44.560

Pandas string methods are only "vectorized" in the sense that you don't have to write the loop yourself. There isn't actually any parallelization going on, because string (especially regex problems) are inherently difficult (impossible?) to parallelize. If you really want speed, you actually should fall back to python here.

%timeit df['facility'].str.split('~', n=1).str[0]
%timeit [x.split('~', 1)[0] for x in df['facility'].tolist()]

411 ms ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
132 ms ± 302 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

For more information on when loops are faster than pandas functions, take a look at For loops with pandas - When should I care?.

As for why apply is faster, I'm of the belief that the function apply is applying (i.e., str.split) is a lot more lightweight than the string splitting happening in the bowels of Series.str.split.

How to speed up pandas string function?

1 Answers1

Linked