237

Today I was positively surprised by the fact that while reading data from a data file (for example) pandas is able to recognize types of values:

df = pandas.read_csv('test.dat', delimiter=r"\s+", names=['col1','col2','col3'])

For example it can be checked in this way:

for i, r in df.iterrows():
    print type(r['col1']), type(r['col2']), type(r['col3'])

In particular integer, floats and strings were recognized correctly. However, I have a column that has dates in the following format: 2013-6-4. These dates were recognized as strings (not as python date-objects). Is there a way to "learn" pandas to recognized dates?

Serge Stroobandt
  • 28,495
  • 9
  • 107
  • 102
Roman
  • 124,451
  • 167
  • 349
  • 456

13 Answers13

473

You should add parse_dates=True, or parse_dates=['column name'] when reading, thats usually enough to magically parse it. But there are always weird formats which need to be defined manually. In such a case you can also add a date parser function, which is the most flexible way possible.

Suppose you have a column 'datetime' with your string, then:

from datetime import datetime
dateparse = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')

df = pd.read_csv(infile, parse_dates=['datetime'], date_parser=dateparse)

This way you can even combine multiple columns into a single datetime column, this merges a 'date' and a 'time' column into a single 'datetime' column:

dateparse = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')

df = pd.read_csv(infile, parse_dates={'datetime': ['date', 'time']}, date_parser=dateparse)

You can find directives (i.e. the letters to be used for different formats) for strptime and strftime in this page.

Rutger Kassies
  • 61,630
  • 17
  • 112
  • 97
  • 13
    Did not work for me, I got the following error: `TypeError: strptime() argument 1 must be str, not float` – Jean Paul May 04 '17 at 12:36
  • 7
    I got this error because there were nan in my data frame. – Jean Paul May 04 '17 at 13:10
  • can you add an item that also NaTs the non-parsable material or NaN or /Ns. cause it seems this parser totally skips the whole column if anything like that is present – Amir Oct 17 '17 at 20:31
  • 3
    There's an option `infer_datetime_format`: "pandas will attempt to infer the format of the datetime strings in the columns". This can be used instead of `date_parser`. – Winand Oct 12 '18 at 11:09
  • 1
    Note that if your dates are in `ISO 8601` format you should not pass `infer_datetime_format` or a parser function - it's much slower than letting pandas handle it (especially the latter). The dateformat in this answer falls into this category also – Mr_and_Mrs_D Dec 04 '18 at 00:25
  • 1
    Is a lambda function necessary here? I think we can just directly apply the function to the date column. – lstodd Feb 12 '19 at 13:45
  • As mentioned in the post, it's necessary if you want to deal with a specific date format. Not specifying anything makes assumptions about the format you provide. – Rutger Kassies Feb 13 '19 at 09:55
  • I am using the same format for parsing my date column (only difference is that I use ''%Y-%m-%d'). It does the parsing and reads the csv into dataframe. But still my column is of type string instead of datetime. What am I missing? \ – BobbyF Feb 19 '19 at 22:39
  • I stopped the environment and restarted and it worked!! – BobbyF Feb 19 '19 at 22:55
  • 1
    Adding `parse_dates=True` doesn't work, where 'date' is the column and the dates are stored as `YYYY-MM-DD`. – ifly6 Jun 05 '19 at 20:03
  • Getting this error: dateparse takes 1 positional argument but 2 were given, I am trying to do it for multiple columns. – royalyadnesh Jan 20 '20 at 09:05
  • @royalyadnesh You need something like dateparse = lambda x,y : ... – Septacle May 06 '20 at 01:17
  • 2
    `pd.datetime` is currently deprecated, replace `pd.datetime` with just `datetime` after `import datetime from datetime`. – Ébe Isaac Sep 24 '20 at 08:51
  • Maybe add a remark on `dtype`, the type you need to specify for `read_csv` argument is the input type and not the output type. So its `datetime={'datetime': 'string'}` and not `datetime={'datetime': 'datetime64'}`. If you do that the output type will be correct (`datetime64[ns]`). – cglacet Sep 29 '20 at 09:26
  • Why need to use a function instead of date_parser=datetime.strptime(x, '%Y-%m-%d %H:%M:%S') directly? – saga Nov 22 '20 at 23:10
  • Can I use multiple date format check with or condition? – Raman Joshi Aug 19 '21 at 11:44
  • What if I don't know the column name when uploading csv file. I want some generic solution to auto detect date column when uploading csv file with some pre-defined date formats. – Raman Joshi Aug 19 '21 at 11:54
  • It works fine, thanks. Note : `dateparse = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')`. Lambdas are anonymous functions. If you name an anonymous function, you might as well just use a function. So either pass a lambda directly as argument to `read_csv`, or define a `dateparse` function the usual way. – Eric Duminil Feb 01 '22 at 21:30
  • 1
    If you get value error due to format like `ValueError: time data '8-Jan-97' does not match format '%d-%mm-%yy'` use format like: `'%d-%b-%y'` – Rajesh Swarnkar Sep 17 '22 at 06:06
  • The response is incomplete, index _col must be specified: `dateparse = lambda x: datetime.strptime(x, '%d/%m/%Y') res = pd.read_csv(fn, index_col=['Date'], parse_dates=['Date'], date_parser=dateparse)` – Bogdan Jun 27 '23 at 16:45
32

Perhaps the pandas interface has changed since @Rutger answered, but in the version I'm using (0.15.2), the date_parser function receives a list of dates instead of a single value. In this case, his code should be updated like so:

from datetime import datetime
import pandas as pd

dateparse = lambda dates: [datetime.strptime(d, '%Y-%m-%d %H:%M:%S') for d in dates]
    
df = pd.read_csv('test.dat', parse_dates=['datetime'], date_parser=dateparse)

Since the original question asker said he wants dates and the dates are in 2013-6-4 format, the dateparse function should really be:

dateparse = lambda dates: [datetime.strptime(d, '%Y-%m-%d').date() for d in dates]
Sean
  • 4,365
  • 1
  • 27
  • 31
22

You could use pandas.to_datetime() as recommended in the documentation for pandas.read_csv():

If a column or index contains an unparseable date, the entire column or index will be returned unaltered as an object data type. For non-standard datetime parsing, use pd.to_datetime after pd.read_csv.

Demo:

>>> D = {'date': '2013-6-4'}
>>> df = pd.DataFrame(D, index=[0])
>>> df
       date
0  2013-6-4
>>> df.dtypes
date    object
dtype: object
>>> df['date'] = pd.to_datetime(df.date, format='%Y-%m-%d')
>>> df
        date
0 2013-06-04
>>> df.dtypes
date    datetime64[ns]
dtype: object
Eugene Yarmash
  • 142,882
  • 41
  • 325
  • 378
14

When merging two columns into a single datetime column, the accepted answer generates an error (pandas version 0.20.3), since the columns are sent to the date_parser function separately.

The following works:

def dateparse(d,t):
    dt = d + " " + t
    return pd.datetime.strptime(dt, '%d/%m/%Y %H:%M:%S')

df = pd.read_csv(infile, parse_dates={'datetime': ['date', 'time']}, date_parser=dateparse)
DaFois
  • 2,197
  • 8
  • 26
  • 43
IamTheWalrus
  • 594
  • 5
  • 14
  • 1
    I'm using pandas 0.22 and agree that the accepted answer no longer works. – Dai May 06 '18 at 23:31
  • 1
    This creates a "TypeError: can only concatenate str (not "float") to str" for me. Date column is d/m/y and time column is H:M:00 – IceQueeny Oct 27 '18 at 11:56
10

pandas read_csv method is great for parsing dates. Complete documentation at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html

you can even have the different date parts in different columns and pass the parameter:

parse_dates : boolean, list of ints or names, list of lists, or dict
If True -> try parsing the index. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a
separate date column. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date
column. {‘foo’ : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’

The default sensing of dates works great, but it seems to be biased towards north american Date formats. If you live elsewhere you might occasionally be caught by the results. As far as I can remember 1/6/2000 means 6 January in the USA as opposed to 1 Jun where I live. It is smart enough to swing them around if dates like 23/6/2000 are used. Probably safer to stay with YYYYMMDD variations of date though. Apologies to pandas developers,here but i have not tested it with local dates recently.

you can use the date_parser parameter to pass a function to convert your format.

date_parser : function
Function to use for converting a sequence of string columns to an array of datetime
instances. The default uses dateutil.parser.parser to do the conversion.
Joop
  • 7,840
  • 9
  • 43
  • 58
  • 2
    You can specify `dayfirst` as True for European/international dates. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html – Will Gordon Nov 15 '17 at 10:16
9

Yes - according to the pandas.read_csv documentation:

Note: A fast-path exists for iso8601-formatted dates.

So if your csv has a column named datetime and the dates looks like 2013-01-01T01:01 for example, running this will make pandas (I'm on v0.19.2) pick up the date and time automatically:

df = pd.read_csv('test.csv', parse_dates=['datetime'])

Note that you need to explicitly pass parse_dates, it doesn't work without.

Verify with:

df.dtypes

You should see the datatype of the column is datetime64[ns]

Gaurav
  • 1,095
  • 10
  • 19
  • I think you misunderstand the question. The user is curious whether the option could be enabled for his format of string. – Arya McCarthy Apr 10 '17 at 02:51
  • @AryaMcCarthy umm, he basically wants the date to be recognized correctly, so I am mentioning how can he transform the source data so that it is naturally recognized by pandas. Nowhere does he mention he cannot change the format of the source data. – Gaurav Sep 02 '17 at 00:14
4

While loading csv file contain date column.We have two approach to to make pandas to recognize date column i.e

  1. Pandas explicit recognize the format by arg date_parser=mydateparser

  2. Pandas implicit recognize the format by agr infer_datetime_format=True

Some of the date column data

01/01/18

01/02/18

Here we don't know the first two things It may be month or day. So in this case we have to use Method 1:- Explicit pass the format

    mydateparser = lambda x: pd.datetime.strptime(x, "%m/%d/%y")
    df = pd.read_csv(file_name, parse_dates=['date_col_name'],
date_parser=mydateparser)

Method 2:- Implicit or Automatically recognize the format

df = pd.read_csv(file_name, parse_dates=[date_col_name],infer_datetime_format=True)
kamran kausar
  • 4,117
  • 1
  • 23
  • 17
4

Pandas 2.0 update

As part of a broad attempt to make datetime handling in Pandas 2.0 more reliable, date parsing (especially on CSV files) has seen a number of backwards-incompatible changes and deprecations. The infer_datetime_format and date_parser parameters reccomended by many other answers are now both deprecated.

The proper way of parsing dates of known format now is now to use the parse_dates and date_format parameters of pd.read_csv().

Single column with auto-detected format

df = pd.read_csv(
    infile,
    parse_dates=['My DateTime']
)

Single column with known format

df = pd.read_csv(
    infile,
    parse_dates=['My DateTime'],
    date_format={'My DateTime': '%Y-%m-%d %H:%M:%S'}
)

Merged columns with known format

df = pd.read_csv(
    infile,
    parse_dates={'mydatetime': ['My Date', 'My Time']},
    # mydatetime will contain my_date and my_time separated by a single space
    date_format={'mydatetime': '%Y-%m-%d %H:%M:%S'}
)

If your date parsing logic is more complex than what can be done with static format strings, such as if you have multiple date formats on the same column, or were making advanced use of date_parser, you are encouraged by the official documentation to leave your column as the default object type while reading the file and to do the date conversion in a second pass with pd.to_datetime, as shown below.

Pandas v1 answer

In addition to what the other replies said, if you have to parse very large files with hundreds of thousands of timestamps, date_parser can prove to be a huge performance bottleneck, as it's a Python function called once per row. You can get a sizeable performance improvements by instead keeping the dates as text while parsing the CSV file and then converting the entire column into dates in one go:

# For a data column
df = pd.read_csv(infile, parse_dates={'mydatetime': ['date', 'time']})

df['mydatetime'] = pd.to_datetime(df['mydatetime'], exact=True, cache=True, format='%Y-%m-%d %H:%M:%S')
# For a DateTimeIndex
df = pd.read_csv(infile, parse_dates={'mydatetime': ['date', 'time']}, index_col='mydatetime')

df.index = pd.to_datetime(df.index, exact=True, cache=True, format='%Y-%m-%d %H:%M:%S')
# For a MultiIndex
df = pd.read_csv(infile, parse_dates={'mydatetime': ['date', 'time']}, index_col=['mydatetime', 'num'])

idx_mydatetime = df.index.get_level_values(0)
idx_num = df.index.get_level_values(1)
idx_mydatetime = pd.to_datetime(idx_mydatetime, exact=True, cache=True, format='%Y-%m-%d %H:%M:%S')
df.index = pd.MultiIndex.from_arrays([idx_mydatetime, idx_num])

For my use case on a file with 200k rows (one timestamp per row), that cut down processing time from about a minute to less than a second.

VLRoyrenn
  • 596
  • 5
  • 11
  • Have you tried the ```infer_datetime_format``` parameter for read_csv. It infers the dates which is fine if your dates are consistently formatted. It speeds up the process. – Cam Aug 30 '21 at 21:49
  • I would have, but my dates were in a weird format with the month first and the milliseconds part being separated by a third colon instead of a dot. At any rate, I figured it would still be useful to have this method written somewhere, since most sources either use inference or date_parser, with no fast alternative to date_format being given. IIRC the doc is also vague on how multiple date columns are combined when passed as an array to parse_dates (they get separated by spaces). – VLRoyrenn Aug 31 '21 at 20:07
3
  • Read the existing string columns in date and time format respectively


pd.read_csv('CGMData.csv', parse_dates=['Date', 'Time'])

Resulted Columns

enter image description here



  • Concat string columns of date and time and add new column of datetype object - Remove Original columns

  1. if want to rename the new column name then pass as dictionary as show in below example and the new column name will be the key name,
  2. if pass as list of column, new column name will be concate of column name passed in the list separated by _ e.g Date_Time

    # parse_dates={'given_name': ['Date', 'Time']}    
    pd.read_csv("InsulinData.csv",low_memory=False, 
                                 parse_dates=[['Date', 'Time']])

    pd.read_csv("InsulinData.csv",low_memory=False, 
                                 parse_dates={'date_time': ['Date', 'Time']})

  • Concat string columns of date and time and add new column of datetype object and Keep the Original columns


pd.read_csv("InsulinData.csv",low_memory=False, 
          parse_dates=[['Date', 'Time']], keep_date_col=True)

  • Want to change the format of date and time when read from csv


parser = lambda x: pd.to_datetime(x, format='%Y-%m-%d %H:%M:%S')
pd.read_csv('path', date_parser=parser, parse_dates=['date', 'time'])

Arpan Saini
  • 4,623
  • 1
  • 42
  • 50
1

If performance matters to you make sure you time:

import sys
import timeit
import pandas as pd

print('Python %s on %s' % (sys.version, sys.platform))
print('Pandas version %s' % pd.__version__)

repeat = 3
numbers = 100

def time(statement, _setup=None):
    print (min(
        timeit.Timer(statement, setup=_setup or setup).repeat(
            repeat, numbers)))

print("Format %m/%d/%y")
setup = """import pandas as pd
import io

data = io.StringIO('''\
ProductCode,Date
''' + '''\
x1,07/29/15
x2,07/29/15
x3,07/29/15
x4,07/30/15
x5,07/29/15
x6,07/29/15
x7,07/29/15
y7,08/05/15
x8,08/05/15
z3,08/05/15
''' * 100)"""

time('pd.read_csv(data); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"]); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
     'infer_datetime_format=True); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
     'date_parser=lambda x: pd.datetime.strptime(x, "%m/%d/%y")); data.seek(0)')

print("Format %Y-%m-%d %H:%M:%S")
setup = """import pandas as pd
import io

data = io.StringIO('''\
ProductCode,Date
''' + '''\
x1,2016-10-15 00:00:43
x2,2016-10-15 00:00:56
x3,2016-10-15 00:00:56
x4,2016-10-15 00:00:12
x5,2016-10-15 00:00:34
x6,2016-10-15 00:00:55
x7,2016-10-15 00:00:06
y7,2016-10-15 00:00:01
x8,2016-10-15 00:00:00
z3,2016-10-15 00:00:02
''' * 1000)"""

time('pd.read_csv(data); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"]); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
     'infer_datetime_format=True); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
     'date_parser=lambda x: pd.datetime.strptime(x, "%Y-%m-%d %H:%M:%S")); data.seek(0)')

prints:

Python 3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 03:13:28) 
[Clang 6.0 (clang-600.0.57)] on darwin
Pandas version 0.23.4
Format %m/%d/%y
0.19123052499999993
8.20691274
8.143124389
1.2384357139999977
Format %Y-%m-%d %H:%M:%S
0.5238807110000039
0.9202787830000005
0.9832778819999959
12.002349824999996

So with iso8601-formatted date (%Y-%m-%d %H:%M:%S is apparently an iso8601-formatted date, I guess the T can be dropped and replaced by a space) you should not specify infer_datetime_format (which does not make a difference with more common ones either apparently) and passing your own parser in just cripples performance. On the other hand, date_parser does make a difference with not so standard day formats. Be sure to time before you optimize, as usual.

Mr_and_Mrs_D
  • 32,208
  • 39
  • 178
  • 361
1

You can use the parameter date_parser with a function for converting a sequence of string columns to an array of datetime instances:

parser = lambda x: pd.to_datetime(x, format='%Y-%m-%d %H:%M:%S')
pd.read_csv('path', date_parser=parser, parse_dates=['date_col1', 'date_col2'])
Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73
1

Yes, this code works like breeze. Here index 0 refers to the index of the date column.

df = pd.read_csv(filepath, parse_dates=[0], infer_datetime_format = True)
0

No, there is no way in pandas to automatically recognize date columns.

Pandas does a poor job at type inference. It basically puts most columns as the generic object type, unless you manually work around it eg. using the abovementioned parse_dates parameter.

If you want to automatically detect columns types, you'd have to use a separate data profiling tool, eg. visions, and then cast or feed the inferred types back into your DataFrame constructor (eg. for dates and from_csv, using the parse_dates parameter).

Michał Zawadzki
  • 695
  • 6
  • 14