2

I am trying to load a csv file which is formatted as follow :

                  40010  40015  40020  40025  40030  40035  40040  40045  
2008-11-03 00:00    786    212    779    227    220    131    680   1006   
2008-11-03 00:03    760    200    765    234    225    133    694   1063   
2008-11-03 00:06    757    205    769    237    230    136    726   1051   
2008-11-03 00:09    781    207    765    240    235    137    711   1040   
2008-11-03 00:12    759    203    751    232    225    134    717   1088
...

The file is comma separated. No fixed width here.

I want the rows indexes to be datetime so here's what i'm currently doing when loading the file :

def dateparse (timestamp):   
    return datetime.datetime.strptime(timestamp, '%Y-%m-%d %I:%M')

global_data_train = pd.read_csv('RTAHistorical.csv', sep=",",parse_dates=True, date_parser=dateparse, header=0, index_col=0, skip_blank_lines = True, engine='python')

But i'm getting the following error :

TypeError: strptime() argument 1 must be str, not numpy.ndarray

As I have seen some people using the very same method successfully, I don't quite understand this error.

What am I doing wrong ?

Community
  • 1
  • 1
Joulin Nicolas
  • 260
  • 2
  • 15
  • Try `global_data_train = pd.read_csv('RTAHistorical.csv', sep=",",parse_dates=[0], header=0, index_col=0, skip_blank_lines = True, engine='python')` Also by default `header` is defaulted to line `0` and you probably don't need `engine='python'` here as it will try the c-engine and fallback to python in the case where you're trying some fancy separator, likewise you don't need to specify `separator` arg here as it's defaulted to comma. Also is what you've posted what the raw data looks like? If so it looks like a fixed width file – EdChum Oct 24 '16 at 09:13
  • That worked. The documentation for `read_csv` says : **boolean** : If True -> try parsing the index. **list of ints or names** -> try parsing columns 1, 2, 3 each as a separate date column. What is the difference between doing`parse_dates=[0]` and `parse_dates=True` then ? – Joulin Nicolas Oct 24 '16 at 09:21
  • 1
    I think that there is a bug when you try to specify a dateparser func and tell it to parse_dates=True and specify the index_col, if your format can be handled by the in-built dateparser then you can just specify the index col index `date_parse=[0]` and the `index_col=[0]` args it will just work no need to over complicate things – EdChum Oct 24 '16 at 09:23

1 Answers1

1

For me works change format to %Y-%m-%d %H:%M:

def dateparse (timestamp):   
    return pd.datetime.strptime(timestamp, '%Y-%m-%d %H:%M')

Sample:

import pandas as pd
from pandas.compat import StringIO

temp=u"""40010,40015,40020,40025,40030,40035,40040,40045
2008-11-03 00:00,786,212,779,227,220,131,680,1006
2008-11-03 00:03,760,200,765,234,225,133,694,1063
2008-11-03 00:06,757,205,769,237,230,136,726,1051
2008-11-03 00:09,781,207,765,240,235,137,711,1040
2008-11-03 00:12,759,203,751,232,225,134,717,1088"""
#after testing replace StringIO(temp) to filename
def dateparse (timestamp):   
    return pd.datetime.strptime(timestamp, '%Y-%m-%d %H:%M')

global_data_train = pd.read_csv(StringIO(temp), 
                                sep=",", 
                                parse_dates=True, 
                                date_parser=dateparse, 
                                header=0, 
                                index_col=0, 
                                skip_blank_lines = True, 
                                engine='python')
print (global_data_train)
                     40010  40015  40020  40025  40030  40035  40040  40045
2008-11-03 00:00:00    786    212    779    227    220    131    680   1006
2008-11-03 00:03:00    760    200    765    234    225    133    694   1063
2008-11-03 00:06:00    757    205    769    237    230    136    726   1051
2008-11-03 00:09:00    781    207    765    240    235    137    711   1040
2008-11-03 00:12:00    759    203    751    232    225    134    717   1088

print (global_data_train.index)
DatetimeIndex(['2008-11-03 00:00:00', '2008-11-03 00:03:00',
               '2008-11-03 00:06:00', '2008-11-03 00:09:00',
               '2008-11-03 00:12:00'],
              dtype='datetime64[ns]', freq=None)

Also works omit date_parser=dateparse.

import pandas as pd
from pandas.compat import StringIO

temp=u"""40010,40015,40020,40025,40030,40035,40040,40045
2008-11-03 00:00,786,212,779,227,220,131,680,1006
2008-11-03 00:03,760,200,765,234,225,133,694,1063
2008-11-03 00:06,757,205,769,237,230,136,726,1051
2008-11-03 00:09,781,207,765,240,235,137,711,1040
2008-11-03 00:12,759,203,751,232,225,134,717,1088"""
#after testing replace StringIO(temp) to filename
global_data_train = pd.read_csv(StringIO(temp), 
                                parse_dates=True, 
                                skip_blank_lines = True)
print (global_data_train)
                     40010  40015  40020  40025  40030  40035  40040  40045
2008-11-03 00:00:00    786    212    779    227    220    131    680   1006
2008-11-03 00:03:00    760    200    765    234    225    133    694   1063
2008-11-03 00:06:00    757    205    769    237    230    136    726   1051
2008-11-03 00:09:00    781    207    765    240    235    137    711   1040
2008-11-03 00:12:00    759    203    751    232    225    134    717   1088

print (global_data_train.index)
DatetimeIndex(['2008-11-03 00:00:00', '2008-11-03 00:03:00',
               '2008-11-03 00:06:00', '2008-11-03 00:09:00',
               '2008-11-03 00:12:00'],
              dtype='datetime64[ns]', freq=None)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252