9

I was going through pandas documentation. And it quoted that enter image description here

I have a sample csv data file.

Date
22-01-1943
15-10-1932
23-11-1910
04-05-2000
02-02-1943
01-01-1943
28-08-1943
31-12-1943
22-01-1943
15-10-1932
23-11-1910
04-05-2000
02-02-1943
01-01-1943
28-08-1943
31-12-1943
22-01-1943
15-10-1932
23-11-1910
04-05-2000
02-02-1943
01-01-1943
28-08-1943
31-12-1943
22-01-1943
15-10-1932
23-11-1910
04-05-2000
02-02-1943
01-01-1943
28-08-1943
31-12-1943

Next I tried

In [174]: %timeit df = pd.read_csv("a.csv", parse_dates=["Date"])
1.5 ms ± 178 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [175]: %timeit df = pd.read_csv("a.csv", parse_dates=["Date"], infer_datetime_format=True)
1.73 ms ± 45 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So, according to the documentation it should be less time. Is my understanding correct? Or on what data does the statement hold good?

Update: Pandas version - '1.0.5'

bigbounty
  • 16,526
  • 5
  • 37
  • 65
  • @ALollz Updated question with pandas version – bigbounty Jul 26 '20 at 15:47
  • 3
    Yeah, my feeling is that the "faster parsing" is now not really an issue: This is an old answer of mine: https://stackoverflow.com/questions/52480839/slow-pd-to-datetime. You can see that one of the date formats used to take 14 second to parse, but using infer freq it would take only like ~300 ms (probably like pd.__version__ 0.23 or something). Now with 1.0.5 that format is parsed in a blazing 9ms. So there *might* be some weird format for which you'd still see that slow parsing automatically (in which case infer_datetieme_format` will save a HUGE amount of time) but not in this case – ALollz Jul 26 '20 at 15:50
  • 2
    For reference, the performance improvement was implemented in v 0.25.0: https://github.com/pandas-dev/pandas/pull/25922. Seems like basically everything that can be inferred now has a fast cython parsing that will be used instead of `dateutil.parser.parser`. But maybe there's some straggler – ALollz Jul 26 '20 at 16:13
  • 1
    This seems like a pretty small CSV, where the one-time-cost of inferring a format may take longer than the accumulated per-item saving. Have you tried parsing files of significant length? – MisterMiyagi Jul 28 '20 at 14:51

1 Answers1

1

What you actually want to do is add dayfirst = True

%timeit df = pd.read_csv("C:/Users/k_sego/Dates.csv", parse_dates=["Date"],dayfirst = True, infer_datetime_format=True)
1.96 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Compared to

%timeit df = pd.read_csv("C:/Users/k_sego/Dates.csv", parse_dates=["Date"])
2.38 ms ± 182 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

and

%timeit df = pd.read_csv("C:/Users/k_sego/Dates.csv", parse_dates=["Date"], infer_datetime_format=True)
3.02 ms ± 670 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The solution is to reduce the number of choices read_csv has to do things.