1

I am using jupyter notebook and reading a .csv file with pandas read_csv. When using the following code, it takes a reeeeeeeally long time (more than 10 minutes). The dataset has 70821 entries.

    [Input]
    df = pd.read_csv("file.csv", parse_dates=[0])
    df.head()
    
    [Output]
    created_at  field1  field2
    2022-09-16T22:53:19+07:00   100.0   NaN
    2022-09-16T22:54:46+07:00   100.0   NaN
    2022-09-16T22:56:14+07:00   100.0   NaN
    2022-09-16T23:02:01+07:00   100.0   NaN
    2022-09-16T23:03:28+07:00   100.0   NaN

If I just use parse_dates=True it will not detect the date and keep it as object.

If I read the dataset without parse_dates it goes much faster like I would expect it (~1 second). When I then use a seperate line of code to parse the date like

df["created_at"]=pd.to_datetime(df['created_at'])

it goes faster than using parse_dates in read_csv but still takes couple of minutes (around 3 minutes).

Using the following

df["created_at"]=pd.to_datetime(df['created_at'], format="%Y-%m-%d %H:%M:%S%z")

or with the T in the format string

df["created_at"]=pd.to_datetime(df['created_at'], format="%Y-%m-%dT%H:%M:%S%z")

or

df["created_at"]=pd.to_datetime(df['created_at'], infer_datetime_format=True)

does not increase the speed (still around 3 minutes)

So my questions are the following:

  1. Why is the slowest way to parse the date directly with read_csv?
  2. Why is the way with pd.to_datetime faster and
  3. Why does something like format="%Y-%m-%d %H:%M:%S%z" or infer_datetime_format=True not speed up the process
  4. And finally how do I do it in a better way? If I read the file once and parsed it to datetime, what would be the best way to write it back to a csv file so I don't have to go through this process all over again? I assume I have to write a function and manually change every entry to a better formatted date?

Can someone help me figuring out why those different approaches take such a different amount of time and how I speed up e.g. with something I tried in 3.? Thanks a lot.


EDIT:

I tried now to manually adjust the date format and see, where it causes trouble. Turns out, when I delete +07:00 in the date string, it is fast (~500 ms).

Under this following link I uploaded to csv files. example1 is the file with the problematic datetime format. In example2_no_timezone I deleted in every entry the +07:00 part which makes the parsing fast again (expected behaviour).

Folder with two example datasets

The questions above do remain sadly.

  1. Why is pandas not able to read the original date string in an appropriate time
  2. Why is to_datetime faster (but still too slow with the original dataset
  3. How do I fix this without changing the format in the original dataset (e.g., with means of to_datetime and providing format=...)
Mrob
  • 281
  • 4
  • 16
  • 1
    what's the datetime format in your input .csv ? – FObersteiner Jan 19 '23 at 11:39
  • I read a file with 217,200 records (34 columns, 126Mb) in 0.69s with the same code (my first col is a datetime). – Corralien Jan 19 '23 at 11:43
  • @FObersteiner the format is in the the first output `2022-09-16T22:53:19+07:00`. I tried to speed this up with matching the format string `format="%Y-%m-%d %H:%M:%S%z"` but without success. – Mrob Jan 19 '23 at 12:05
  • @Corralien Yes I also think my times are really odd/high. My laptop is not the fastest but definitely not THAT slow – Mrob Jan 19 '23 at 12:06
  • your format is missing a T; `format="%Y-%m-%dT%H:%M:%S%z"`. Also, it might be beneficial to convert to UTC by setting keyword `utc=True`. If you need the datetime to be in a certain time zone, set it afterwards using `.dt.tz_convert("desired-tz-name")` – FObersteiner Jan 19 '23 at 12:44
  • @FObersteiner thanks for the suggestion. I just tried `df["created_at"]=pd.to_datetime(df['created_at'], format="%Y-%m-%dT%H:%M:%S%z")` but CPU time was again 3:14 minutes – Mrob Jan 19 '23 at 15:22
  • 1
    That's definitely too long for a < 100k lines csv. Anything else between you and the file maybe, loading it from a server? Did you try with a simple Python script, I mean without the Jupyter NB, just a plain script which you call from a terminal? – FObersteiner Jan 19 '23 at 15:33
  • @FObersteiner The file is located in the same folder. And I just also tried to use a plain text file script. But it has the same outcome (except maybe for a few seconds faster, but still around 3 minutes). I will try tomorrow to change the date format manually and see if that helps. So far I did never experience anything similar on the same machine with different dates. – Mrob Jan 19 '23 at 16:01
  • Please try to make a [mre], or as close as possible. For specifics, see [How to make good reproducible pandas examples](/q/20109391/4518341). Maybe try moving the code and data to a different machine and seeing how it runs. – wjandrea Jan 20 '23 at 01:18
  • @wjandrea Thank you. I edited the post and provide now two datasets that can be tested with the code snippets provided. – Mrob Jan 20 '23 at 05:06

0 Answers0