30

I'm reading a huge CSV with a date field in the format YYYYMMDD and I'm using the following lambda to convert it when reading:

import pandas as pd

df = pd.read_csv(filen,
                 index_col=None,
                 header=None,
                 parse_dates=[0],
                 date_parser=lambda t:pd.to_datetime(str(t),
                                            format='%Y%m%d', coerce=True))

This function is very slow though.

Any suggestion to improve it?

EdChum
  • 376,765
  • 198
  • 813
  • 562
ppaulojr
  • 3,579
  • 4
  • 29
  • 56

8 Answers8

90

Note: As @ritchie46's answer states, this solution may be redundant since pandas version 0.25 per the new argument cache_dates that defaults to True

Try using this function for parsing dates:

def lookup(date_pd_series, format=None):
    """
    This is an extremely fast approach to datetime parsing.
    For large data, the same dates are often repeated. Rather than
    re-parse these, we store all unique dates, parse them, and
    use a lookup to convert all dates.
    """
    dates = {date:pd.to_datetime(date, format=format) for date in date_pd_series.unique()}
    return date_pd_series.map(dates)

Use it like:

df['date-column'] = lookup(df['date-column'], format='%Y%m%d')

Benchmarks:

$ python date-parse.py
to_datetime: 5799 ms
dateutil:    5162 ms
strptime:    1651 ms
manual:       242 ms
lookup:        32 ms

Source: https://github.com/sanand0/benchmarks/tree/master/date-parse

mathetes
  • 11,766
  • 7
  • 25
  • 32
fixxxer
  • 15,568
  • 15
  • 58
  • 76
  • There's a problem with NaN: `return s.apply(lambda v: dates[v]) KeyError: nan` – ppaulojr Apr 29 '15 at 18:51
  • Weird. I couldn't reproduce it. Can we see the series that you are passing to the function ? Also, what is your Pandas version. – fixxxer Apr 30 '15 at 02:58
  • 3
    Wow: this is super! I have 1M row test file. It takes 1 second to read it (no date parsing) but 1m20s if I turn on parse_dates. Your lookup() solution adds only 0.4 seconds to the initial 1 sec read_csv(). – jdmarino Aug 10 '15 at 19:20
  • 1
    @fixxer this is incredible..clever use of apply! what a great principle..can be applied as a function for any kind of dict look up. – JDE876 Oct 09 '15 at 20:16
  • Actually, lookup2 function that uses .map instead .apply is way faster. Here's the code - https://github.com/sanand0/benchmarks/blob/master/date-parse/date-parse.py – fixxxer Oct 10 '15 at 18:52
  • @fixxxer this is incredibly fast, but I am also getting KeyErro: nan. Here is a series with a np.nan value. – JDE876 Oct 12 '15 at 18:13
  • 2
    The benchmark results above are based on parsing 100,000 identical dates. I haven't run into data like this before. I've re-run the same test using 100,000 random dates and now find that "manual" is much faster than "lookup". Unsurprisingly as I introduce more repeated dates, "lookup" performs better. On my system it surpasses "manual" at roughly 125 repetitions per date. – Ben Graham Oct 19 '18 at 02:29
  • I have a csv with ~2.5 M. There is one date column among those which i choose to read. Reading CSV without any specific option, takes ~9s. If I read with parse_dates on that column, it takes 2m20s. With this lookup function (exactly above), it takes (much) longer, about 3m35s. seems that this answer doesnt work for every data, at least not in my case. – physiker Jan 09 '19 at 11:56
  • 1
    I also supplied the optimal `format=` argument in your `pd.to_datetime` line. Anyway, this piece of code sped it up like crazy. I had a 13 GB csv and each date was repeated about 340,000 times. I was previously including `cache=True` in `pd.to_datetime`, thinking that its performance would be similar, but this is wildly faster. – Corey Levinson Feb 26 '19 at 22:48
  • I am somewhat surprised this is the accepted answer because without specifying a format for the pandas datetime converter there is *no guarantee this function will convert the dates consistently*. I have personally run into this issue when testing this snippet. The edit to make this correct is simple but a crucial one. I would NOT trust this to run without specifying a datetime format. – sfortney Mar 27 '20 at 20:57
  • Extremely useful. saved my time. – Bharath Ram May 21 '20 at 05:57
14

Great suggestion @EdChum! As @EdChum suggests, using infer_datetime_format=True can be significantly faster. Below is my example.

I have a file of temperature data from a sensor log, which looks like this:

RecNum,Date,LocationID,Unused
1,11/7/2013 20:53:01,13.60,"117","1",
2,11/7/2013 21:08:01,13.60,"117","1",
3,11/7/2013 21:23:01,13.60,"117","1",
4,11/7/2013 21:38:01,13.60,"117","1",
...

My code reads the csv and parses the date (parse_dates=['Date']). With infer_datetime_format=False, it takes 8min 8sec:

Tue Jan 24 12:18:27 2017 - Loading the Temperature data file.
Tue Jan 24 12:18:27 2017 - Temperature file is 88.172 MB.
Tue Jan 24 12:18:27 2017 - Loading into memory. Please be patient.
Tue Jan 24 12:26:35 2017 - Success: loaded 2,169,903 records.

With infer_datetime_format=True, it takes 13sec:

Tue Jan 24 13:19:58 2017 - Loading the Temperature data file.
Tue Jan 24 13:19:58 2017 - Temperature file is 88.172 MB.
Tue Jan 24 13:19:58 2017 - Loading into memory. Please be patient.
Tue Jan 24 13:20:11 2017 - Success: loaded 2,169,903 records.
Sam Davey
  • 331
  • 2
  • 5
7

Unless you're stuck with a very old version of pandas, pre 0.25, this answer is not for you.

The functionality described here has been merged into pandas in version 0.25

Streamlined date parsing with caching

Reading all data and then converting it will always be slower than converting while reading the CSV. Since you won't need to iterate over all the data twice if you do it right away. You also don't have to store it as strings in memory.

We can define our own date parser that utilizes a cache for the dates it has already seen.

import pandas as pd

cache = {}

def cached_date_parser(s):
    if s in cache:
        return cache[s]
    dt = pd.to_datetime(s, format='%Y%m%d', coerce=True)
    cache[s] = dt
    return dt
    
df = pd.read_csv(filen,
                 index_col=None,
                 header=None,
                 parse_dates=[0],
                 date_parser=cached_date_parser)

Has the same advantages as @fixxxer s answer with only parsing each string once, with the extra added bonus of not having to read all the data and THEN parse it. Saving you memory and processing time.

firelynx
  • 30,616
  • 9
  • 91
  • 101
  • I got error `to_datetime() got an unexpected keyword argument 'coerce'`. When I tried to pass the argument `errors='coerce'`, I got `NaT` instead. – Anastasiya-Romanova 秀 Aug 23 '19 at 03:12
  • Thank you for shating this, but as @ritchie46 indicated, since pandas 0.25 the caching is automatically done by default through `cache_dates=boolean`: https://stackoverflow.com/a/59682653/2906290 – Raúl Moreno Oct 10 '22 at 16:54
5

Since pandas version 0.25 the function pandas.read_csv accepts a cache_dates=boolean (which defaults to True) keyword argument. So no need to write your own function for caching as done in the accepted answer.

ritchie46
  • 10,405
  • 1
  • 24
  • 43
4

No need to specify a date_parser, pandas is able to parse this without any trouble, plus it will be much faster:

In [21]:

import io
import pandas as pd
t="""date,val
20120608,12321
20130608,12321
20140308,12321"""
df = pd.read_csv(io.StringIO(t), parse_dates=[0])
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 2 columns):
date    3 non-null datetime64[ns]
val     3 non-null int64
dtypes: datetime64[ns](1), int64(1)
memory usage: 72.0 bytes
In [22]:

df
Out[22]:
        date    val
0 2012-06-08  12321
1 2013-06-08  12321
2 2014-03-08  12321
EdChum
  • 376,765
  • 198
  • 813
  • 562
3

Try the standard library:

import datetime
parser = lambda t: datetime.datetime.strptime(str(t), "%Y%m%d")

However, I don't really know if this is much faster than pandas.

Since your format is so simple, what about

def parse(t):
     string_ = str(t)
     return datetime.date(int(string_[:4]), int(string[4:6]), int(string[6:]))

EDIT you say you need to take care of invalid data.

def parse(t):
     string_ = str(t)
     try:
         return datetime.date(int(string_[:4]), int(string[4:6]), int(string[6:]))
     except:
         return default_datetime #you should define that somewhere else

All in all, I'm a bit conflicted about the validity of your problem:

  • you need to be fast, but still you get your data from a CSV
  • you need to be fast, but still need to deal with invalid data

That's kind of contradicting; my personal approach here would be assuming that your "huge" CSV just needs to be brought into a better-performing format once, and you either shouldn't care about speed of that conversion process (because it only happens once) or you should probably bring whatever produces the CSV to give you better data--there's so many formats that don't rely on string parsing.

Marcus Müller
  • 34,677
  • 4
  • 53
  • 94
1

If your datetime has UTC timestamp and you just need part of it. Convert it to a string, slice what you need and then apply the below for much faster access.

created_at
2018-01-31 15:15:08 UTC
2018-01-31 15:16:02 UTC
2018-01-31 15:27:10 UTC
2018-02-01 07:05:55 UTC
2018-02-01 08:50:14 UTC

df["date"]=  df["created_at"].apply(lambda x: str(x)[:10])


df["date"] = pd.to_datetime(df["date"])
1

I have a csv with ~150k rows. After trying almost all the suggestions in this post, I found 25% faster to:

  1. read the file row by row using Python3.7 native csv.reader
  2. convert all 4 numeric columns using float() and
  3. parse the date column with datetime.datetime.fromisoformat()

and Behold:

  1. finally convert the list to a DataFrame (!)**

It baffles me how can this be faster than native pandas pd.read_csv(...)... can someone explain?

arod
  • 13,481
  • 6
  • 31
  • 39