3

I have a CSV file with 58K rows of time-series data. The first column is the timestamp and it's all unique values.

When I do pd.read_csv("data.csv"), it takes less than a second. But when I do pd.read_csv("data.csv", parse_dates=[0]) it takes more than 30 seconds. Is this the expected performance to get when parsing timestamps in a csv?

I tried all the solutions here and I couldn't improve it: Pandas: slow date conversion

Is there a way to improve the performance? What if I tell pandas the datetime format?

Here's a working repl to play with: https://repl.it/@eparizzi/Pandas-Testing-1

empz
  • 11,509
  • 16
  • 65
  • 106
  • 1
    what's the dtype when it loads, string or int? and how long does it take to convert the 58k rows into datetime ? my guess is that Pandas is trying to guess the format, specify the format too with `format='%Y-%m-%d'` or whatever your format is – Umar.H Mar 12 '20 at 17:25
  • Yes. `Note: A fast-path exists for iso8601-formatted dates.` Otherwise it uses a very slow path that attempts to infer the format. You're better off reading in the string and then supplying the specific `format=` as an argument to `pd.to_datetime`. This shows just how slow it can be when it tries to infer the format: https://stackoverflow.com/questions/52480839/slow-pd-to-datetime/52481193#52481193 – ALollz Mar 12 '20 at 17:27

1 Answers1

0

The pyarrow parser engine did the trick for me:

import pandas as pd

data = pd.read_csv("data.csv", parse_dates=[0], engine="pyarrow")

My dates are in ISO 8601 format, such as:

2023-02-22T15:41:59.0478497+01:00

You will need to install pyarrow first.

angelo-peronio
  • 179
  • 1
  • 7