Speed up Pandas' findall date syntax analysis

Question

I have this series

df = pd.Series(['01/January/2021', '09/Sept/2021', 'November/11/2022', '02/02/2016', '2021/01/August', '2022-Feb-02'])

To extract 3 letters I applied

df.str.upper().str.findall(r'[a-zA-Z]{3}')

this gives a output of

['JAN', 'UAR'], ['SEP'], ['NOV', 'EMB'], [], ['AUG', 'UST'], ['FEB']

running %timeit

In [22]: %timeit df.str.upper().str.findall(r'[a-zA-Z]{3}')                                                                                 
447 µs ± 3.08 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

to get the first grouping I had to do

df.str.upper().str.findall(r'[a-zA-Z]{3}.str[0]')

resulting into

['JAN', 'SEP', 'NOV', nan, 'AUG', 'FEB']

again running %timeit results into

In [23]: %timeit df.str.upper().str.findall(r'[a-zA-Z]{3}').str[0]                                                                          
710 µs ± 13.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

I also have a list of months

_calc = ['JAN', 'FEB', 'MAR', 'APR', 'MAY', 'JUN', 'JUL', 'AUG', 'SEP', 'OCT', 'NOV', 'DEC']

and my final one line code is now

df.str.upper().str.findall(r'[a-zA-Z]{3}').str[0].isin(_calc).all()

this gives a False for this series, which is the expected result.

running %timeit finally gives

In [25]: %timeit df.str.upper().str.findall(r'[a-zA-Z]{3}').str[0].isin(_calc).all()                                                       
932 µs ± 16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Why am I doing this: I'm reading the date from a csv file, over time I have observed that date format is a very big issue as this is dependent on the users region date format so determining which is month can be an issue, so if the month is either of format %b or %B that way I'm sure that my code picks the right month and can then convert this to datetime.

So where is the issue? If I run my one liner against 1 million entries and %timeit this is what I get

In [28]: %timeit qf.date.str.upper().str.findall(r'[a-zA-Z]{3}').str[0].isin(_calc).all()                                           
1.6 s ± 11.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

How can I speed this up, and where do the slow-downs occur?

Why not convert to date time first and then use date time methods to extract month? — BallpointBen, Apr 09 '20 at 19:15
@BallpointBen The issue with that is this 09/02/2021 to a man in the US is 02/Sept/2021 and to a person in UK that is 09/Feb/2021, this is a major issue with converting to datetime I will not know what format the person used. it is after I have validated that you actually used Jan or January then I can parse the file properly — zadstofun, Apr 09 '20 at 19:29

score 0 · Answer 1 · edited Jul 22 '21 at 21:01

0

I tried your process in vanilla Python, using the "re" module and it was significantly faster; you probably should give it a go. I don't think Pandas string methods are optimized for speed, compared to numbers.

Also, this article by @cs95 might shed more light on string operations in Pandas.

import re
pattern = re.compile(r'[a-zA-Z]{3}')
pat = [pattern.search(entry) for entry in df.tolist() ]
pat = [entry.group().upper() if entry  else None for entry in pat]
res = set(pat)<=set(_calc)

If you are interested in just finding out if there is None in the search results, you could use the built in all function for that:

 pat = all(pattern.search(entry) for entry in df.tolist() )

You can reshape this with an if statement.

edited Jul 22 '21 at 21:01

halfer

19,824
17
99
186

answered Apr 09 '20 at 09:14

sammywemmy

27,093
4
17
31

my upvote is not made public but this was wooping faster. Just wanted to ask if I want an early exit how is this possible. in the pattern search I saw a None in the response if I want to exit at that point without having to go to the group().upper() section how can I do this? As the only time I will want to go to the is if there is no None in the first instance – zadstofun Apr 09 '20 at 15:54
added a line to the answer to cover filtering for None. wrapping the search with ```all``` should return False if there is None present – sammywemmy Apr 09 '20 at 23:29
there is a caveat to all and that is it will also return true if list is empty, that is not to say your suggestion is not an added knowledge for me. Unfortunately my ratings is not high enough to make my vote public. Your suggestion without the all() give me 50% improvement %timeit date_analysis(qf.expiry_date) 827 ms ± 2.64 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) – zadstofun Apr 10 '20 at 10:32
Oh don’t worry about the upvotes. U r right about the list angle. As long as u got an improvement on ur result, all is good. Stack overflow exists for us to help each other. Cheers mate! – sammywemmy Apr 10 '20 at 10:42

Speed up Pandas' findall date syntax analysis

1 Answers1