3

I am practicing vectorization with Pandas, and I discovered a counter-intuitive case when using a chain of built-in vectorized methods is slower than applying a naive-Python function (to extract the first digit of all numbers in a Series):

import sys
import numpy as np
import pandas as pd

s = pd.Series(np.arange(100_000))

def first_digit(x):
    return int(str(x)[0])

s.astype(np.str).str[0].astype(np.int) # 218ms "built-in"
s.apply(first_digit)                   # 104ms "apply"
s.map(first_digit)                     # 104ms "map"
np.vectorize(first_digit)(s)           #  78ms "vectorized"

All 4 implementations produce the same Pandas Series and I completely understand that the vectorized function call might be faster than the per-element apply/map.

However, I am puzzled why using the buil-in methods is slower... While I would be interested in an actual answer too, I am more interested in what is the smallest set of tools I have to learn to be able evaluate my hypothesis about the performance.

My hypothesis is that the chain of method calls is creating 2 extra inter-mediate Pandas Series, and the values of those Series are evaluated greedily, causing CPU cache misses (having to load the inter-mediate Series from RAM).

Following steps in that hypothesis I have no idea how to confirm or falsify:

  1. are the inter-mediate Series / numpy arrays evaluated greedily or lazily?
  2. would it cause CPU cache misses?
  3. what other explanations do I need to consider?

Screenshot of my measurements:

jupyter screenshot

Aprillion
  • 21,510
  • 5
  • 55
  • 89
  • 2
    All of the last three are just fancy wraps of `for` loops. The first line is not a single operations, and also likely non-vectorized (due to `str` acessor ). – Quang Hoang Aug 26 '20 at 19:30
  • 1
    `[first_digit(x) for x in s]` should be very similar in speed. – Quang Hoang Aug 26 '20 at 19:32
  • 1
    In addition to what Quang Hoang said, note that the "built-in" line is creating an additional intermediate array of strings with all the converted numbers, while the other methods only need to create the final array and fill it one element at a time. In any case, there are many cases where NumPy performance with string arrays is not too different from regular Python code. Also, not the point of the question I guess, but extracting the first digit would be faster using numerical operations instead of casting to string and back. – jdehesa Aug 26 '20 at 19:35
  • Pretty much everything is evaluated greedily in numpy/pandas – juanpa.arrivillaga Aug 26 '20 at 19:48
  • 2
    https://stackoverflow.com/a/54028200/4333359. I think that has a very good, albeit long, explanation. IMO, if you are doing some sort of string manipulation things get slow and `vectorization` doesn't take the same meaning it would for an array of numbers. So if possible you should try to use math. In this example. you can shave off a lot of time if you do some math like: `(s//(10**(np.log10(s).clip(lower=0)//1))).astype(int)`. Ofc that assumes you have only positive integers in the range [0, inf). – ALollz Aug 26 '20 at 19:51
  • @QuangHoang I didn't realize that not all methods are actually vectorized (sounds obvious in hindsight, but not before I thought about it) - so the answer linked by ALollz was very helpful – Aprillion Aug 26 '20 at 21:26
  • 1
    @ALollz thanks for the 10x faster suggestion – Aprillion Aug 26 '20 at 21:51
  • 2
    `s.to_numpy().astype('S1').astype(np.int)` improves time quite a bit (though the `np.vectorize` is still better). In `pandas`, strings are stored in object dtype of python strings. So both the `astype(str)` and `str[0]` operations are slow. My suggestion moves the action numpy and takes advantage it defined length string dtypes. Neither `pandas` nor `numpy` implement fast string code; both are heavily dependent on Python for that. – hpaulj Aug 26 '20 at 22:21

1 Answers1

4

To put it short, your question is whether

s.astype(np.str).str[0].astype(np.int)

fuses your operations together, then iterates over the series, or creates a temporary series for each operation, and how to verify this?

My hypothesis (and I guess yours) is that it is the latter. You have the right explanation there but how to test?

My suggestion is:

s1=s.astype(np.str)
s2=s1.str[0]
s3=s2.astype(np.int)

See how long each operation takes and how long the 3 operations take together. Most likely each operation will take about the same amount of time (the complexity of each operation is about the same) which would strongly indicate that our hypothesis is right. If the first two operations take no time, but last, pretty much all of the time, probably our hypothesis is wrong.

Tom
  • 749
  • 4
  • 16
  • thanks, a very nice suggestion. in fact, the `.astype(np.int)` part is an order of magnitude faster – Aprillion Aug 26 '20 at 21:10
  • however, the https://stackoverflow.com/a/54028200/1176601 linked in a comment provides a bigger improvement in my understanding (for point 3), so I marked my question as a duplicate – Aprillion Aug 26 '20 at 21:22