infer_datetime_format with parse_date taking more time

Question

I was going through pandas documentation. And it quoted that

I have a sample csv data file.

Next I tried

In [174]: %timeit df = pd.read_csv("a.csv", parse_dates=["Date"])
1.5 ms ± 178 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [175]: %timeit df = pd.read_csv("a.csv", parse_dates=["Date"], infer_datetime_format=True)
1.73 ms ± 45 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So, according to the documentation it should be less time. Is my understanding correct? Or on what data does the statement hold good?

Update: Pandas version - '1.0.5'

Yeah, my feeling is that the "faster parsing" is now not really an issue: This is an old answer of mine: https://stackoverflow.com/questions/52480839/slow-pd-to-datetime. You can see that one of the date formats used to take 14 second to parse, but using infer freq it would take only like ~300 ms (probably like pd.__version__ 0.23 or something). Now with 1.0.5 that format is parsed in a blazing 9ms. So there *might* be some weird format for which you'd still see that slow parsing automatically (in which case infer_datetieme_format` will save a HUGE amount of time) but not in this case — ALollz, Jul 26 '20 at 15:50
For reference, the performance improvement was implemented in v 0.25.0: https://github.com/pandas-dev/pandas/pull/25922. Seems like basically everything that can be inferred now has a fast cython parsing that will be used instead of `dateutil.parser.parser`. But maybe there's some straggler — ALollz, Jul 26 '20 at 16:13
This seems like a pretty small CSV, where the one-time-cost of inferring a format may take longer than the accumulated per-item saving. Have you tried parsing files of significant length? — MisterMiyagi, Jul 28 '20 at 14:51

score 1 · Answer 1 · answered Nov 06 '20 at 16:44

What you actually want to do is add dayfirst = True

%timeit df = pd.read_csv("C:/Users/k_sego/Dates.csv", parse_dates=["Date"],dayfirst = True, infer_datetime_format=True)
1.96 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Compared to

%timeit df = pd.read_csv("C:/Users/k_sego/Dates.csv", parse_dates=["Date"])
2.38 ms ± 182 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

and

%timeit df = pd.read_csv("C:/Users/k_sego/Dates.csv", parse_dates=["Date"], infer_datetime_format=True)
3.02 ms ± 670 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The solution is to reduce the number of choices read_csv has to do things.

infer_datetime_format with parse_date taking more time

1 Answers1