I am using jupyter notebook and reading a .csv file with pandas read_csv. When using the following code, it takes a reeeeeeeally long time (more than 10 minutes). The dataset has 70821 entries.
[Input]
df = pd.read_csv("file.csv", parse_dates=[0])
df.head()
[Output]
created_at field1 field2
2022-09-16T22:53:19+07:00 100.0 NaN
2022-09-16T22:54:46+07:00 100.0 NaN
2022-09-16T22:56:14+07:00 100.0 NaN
2022-09-16T23:02:01+07:00 100.0 NaN
2022-09-16T23:03:28+07:00 100.0 NaN
If I just use parse_dates=True
it will not detect the date and keep it as object.
If I read the dataset without parse_dates
it goes much faster like I would expect it (~1 second).
When I then use a seperate line of code to parse the date like
df["created_at"]=pd.to_datetime(df['created_at'])
it goes faster than using parse_dates
in read_csv but still takes couple of minutes (around 3 minutes).
Using the following
df["created_at"]=pd.to_datetime(df['created_at'], format="%Y-%m-%d %H:%M:%S%z")
or with the T
in the format string
df["created_at"]=pd.to_datetime(df['created_at'], format="%Y-%m-%dT%H:%M:%S%z")
or
df["created_at"]=pd.to_datetime(df['created_at'], infer_datetime_format=True)
does not increase the speed (still around 3 minutes)
So my questions are the following:
- Why is the slowest way to parse the date directly with read_csv?
- Why is the way with
pd.to_datetime
faster and - Why does something like
format="%Y-%m-%d %H:%M:%S%z"
orinfer_datetime_format=True
not speed up the process - And finally how do I do it in a better way? If I read the file once and parsed it to datetime, what would be the best way to write it back to a csv file so I don't have to go through this process all over again? I assume I have to write a function and manually change every entry to a better formatted date?
Can someone help me figuring out why those different approaches take such a different amount of time and how I speed up e.g. with something I tried in 3.? Thanks a lot.
EDIT:
I tried now to manually adjust the date format and see, where it causes trouble. Turns out, when I delete +07:00
in the date string, it is fast (~500 ms).
Under this following link I uploaded to csv files. example1
is the file with the problematic datetime format. In example2_no_timezone
I deleted in every entry the +07:00
part which makes the parsing fast again (expected behaviour).
Folder with two example datasets
The questions above do remain sadly.
- Why is pandas not able to read the original date string in an appropriate time
- Why is to_datetime faster (but still too slow with the original dataset
- How do I fix this without changing the format in the original dataset (e.g., with means of
to_datetime
and providingformat=...
)