26

I have a simple question related with csv files and parsing datetime.

I have a csv file that look like this:

YYYYMMDD, HH,    X
20110101,  1,   10
20110101,  2,   20
20110101,  3,   30

I would like to read it using pandas (read_csv) and have it in a dataframe indexed by the datetime. So far I've tried to implement the following:

import pandas as pnd
pnd.read_csv("..\\file.csv",  parse_dates = True, index_col = [0,1])

and the result I get is:

                         X
YYYYMMDD    HH            
2011-01-01 2012-07-01   10
           2012-07-02   20
           2012-07-03   30

As you see the parse_dates in converting the HH into a different date.

Is there a simple and efficient way to combine properly the column "YYYYMMDD" with the column "HH" in order to have something like this? :

                      X
Datetime              
2011-01-01 01:00:00  10
2011-01-01 02:00:00  20
2011-01-01 03:00:00  30

Thanks in advance for the help.

piRSquared
  • 285,575
  • 57
  • 475
  • 624
Mauricio
  • 263
  • 1
  • 3
  • 6

2 Answers2

32

If you pass a list to index_col, it means you want to create a hierarchical index out of the columns in the list.

In addition, the parse_dates keyword can be set to either True or a list/dict. If True, then it tries to parse individual columns as dates, otherwise it combines columns to parse a single date column.

In summary, what you want to do is:

from datetime import datetime
import pandas as pd
parse = lambda x: datetime.strptime(x, '%Y%m%d %H')
pd.read_csv("..\\file.csv",  parse_dates = [['YYYYMMDD', 'HH']], 
            index_col = 0, 
            date_parser=parse)
Anthon
  • 69,918
  • 32
  • 186
  • 246
Chang She
  • 16,692
  • 8
  • 40
  • 25
  • 2
    What if you start with a dataframe instead as oppose from reading directly in from csv – user1234440 Feb 12 '14 at 01:53
  • A more general solution that starts with a `DataFrame` and need not assume the date is the index column would be really useful. Perhaps that's a separate question? – cboettig Dec 21 '15 at 23:40
11

I am doing this all the time, so I tested different ways for speed. The fastest I found is the following, approx. 3 times faster than Chang She's solution, at least in my case, when taking the total time of file parsing and date parsing into account:

First, parse the data file using pd.read_csv withOUT parsing dates. I find that it is slowing down the file-reading quite a lot. Make sure that the columns of the CSV file are now columns in the dataframe df. Then:

format = "%Y%m%d %H"
times = pd.to_datetime(df.YYYYMMDD + ' ' + df.HH, format=format)
df.set_index(times, inplace=True)
# and maybe for cleanup
df = df.drop(['YYYYMMDD','HH'], axis=1)
K.-Michael Aye
  • 5,465
  • 6
  • 44
  • 56
  • 2
    doesn't work, I get `unsupported operand type(s) for +: 'numpy.ndarray' and 'str'` – user1234440 Feb 12 '14 at 01:50
  • I assume you already converted the read-in dates to integers. (Even so in that case my error message looks slightly different: TypeError: unsupported operand type(s) for +: 'numpy.dtype' and 'str'. When df.YYYYMMDD and df.HH are strings read-in from a csv file without conversion to integers, this works fine. – K.-Michael Aye Feb 12 '14 at 04:22
  • Not aiming for speed, and with data already loaded, I circumvent that issue with, say, `df['year'].astype(int).astype(str)`. Ugly, but works. ;7) – dmvianna Apr 30 '15 at 01:29