I have this series
df = pd.Series(['01/January/2021', '09/Sept/2021', 'November/11/2022', '02/02/2016', '2021/01/August', '2022-Feb-02'])
To extract 3 letters I applied
df.str.upper().str.findall(r'[a-zA-Z]{3}')
this gives a output of
['JAN', 'UAR'], ['SEP'], ['NOV', 'EMB'], [], ['AUG', 'UST'], ['FEB']
running %timeit
In [22]: %timeit df.str.upper().str.findall(r'[a-zA-Z]{3}')
447 µs ± 3.08 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
to get the first grouping I had to do
df.str.upper().str.findall(r'[a-zA-Z]{3}.str[0]')
resulting into
['JAN', 'SEP', 'NOV', nan, 'AUG', 'FEB']
again running %timeit results into
In [23]: %timeit df.str.upper().str.findall(r'[a-zA-Z]{3}').str[0]
710 µs ± 13.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I also have a list of months
_calc = ['JAN', 'FEB', 'MAR', 'APR', 'MAY', 'JUN', 'JUL', 'AUG', 'SEP', 'OCT', 'NOV', 'DEC']
and my final one line code is now
df.str.upper().str.findall(r'[a-zA-Z]{3}').str[0].isin(_calc).all()
this gives a False for this series, which is the expected result.
running %timeit finally gives
In [25]: %timeit df.str.upper().str.findall(r'[a-zA-Z]{3}').str[0].isin(_calc).all()
932 µs ± 16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Why am I doing this: I'm reading the date from a csv file, over time I have observed that date format is a very big issue as this is dependent on the users region date format so determining which is month can be an issue, so if the month is either of format %b or %B that way I'm sure that my code picks the right month and can then convert this to datetime.
So where is the issue? If I run my one liner against 1 million entries and %timeit this is what I get
In [28]: %timeit qf.date.str.upper().str.findall(r'[a-zA-Z]{3}').str[0].isin(_calc).all()
1.6 s ± 11.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
How can I speed this up, and where do the slow-downs occur?