0

I would like to separate my timestamp into different columns, as I would like to sort my data by year. My time stamp looks like this:

  12/08/2011 11:04:13 AM

Sample of data:

BusinessName     DBAName    LegalOwner  NameLast    NameFirst   ISSDTTM
NEW GARDEN RESTAURANT ZHENG JIA XIANG   12/8/11 11:04
Estragon        CITY OF BOSTON  De Haro Julio   1/18/12 17:01
TAIWAN CAFE     Yan-Fang Zhang Pres.    8/3/12 14:54
Earl of Sandwich Earl of Sandwich (Boston) L L C    1/7/13 9:08
TAIWAN CAFE         Yan-Fang Zhang Pres.    8/3/12 14:54

I tried using this:

df = pd.read_csv("Food_Establishment_Inspections.csv", low_memory=False, index_col='datetime', 
parse_dates={'year':[0]}, 
date_parser=lambda x:pd.datetime.strptime(x,'%m/%d/%Y %I:%M:%S %p').strftime('%Y'))

But I am receiving an error, which I displayed below:

   TypeError                                 Traceback (most recent call last)
//anaconda/lib/python3.6/site-packages/pandas/io/parsers.py in converter(*date_cols)
   2661                 result = tools.to_datetime(
-> 2662                     date_parser(*date_cols), errors='ignore')
   2663                 if isinstance(result, datetime.datetime):

<ipython-input-60-4bd0710aff18> in <lambda>(x)
      2                  parse_dates={'year':[0]},
----> 3                  date_parser=lambda x:pd.datetime.strptime(x,'%m/%d/%Y %I:%M:%S %p').strftime('%Y'))

TypeError: strptime() argument 1 must be str, not numpy.ndarray

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
//anaconda/lib/python3.6/site-packages/pandas/io/parsers.py in converter(*date_cols)
   2670                                             parser=date_parser,
-> 2671                                             dayfirst=dayfirst),
   2672                         errors='ignore')

pandas/src/inference.pyx in pandas.lib.try_parse_dates (pandas/lib.c:61053)()

pandas/src/inference.pyx in pandas.lib.try_parse_dates (pandas/lib.c:60951)()

<ipython-input-60-4bd0710aff18> in <lambda>(x)
      2                  parse_dates={'year':[0]},
----> 3                  date_parser=lambda x:pd.datetime.strptime(x,'%m/%d/%Y %I:%M:%S %p').strftime('%Y'))

//anaconda/lib/python3.6/_strptime.py in _strptime_datetime(cls, data_string, format)
    564     format string."""
--> 565     tt, fraction = _strptime(data_string, format)
    566     tzname, gmtoff = tt[-2:]

//anaconda/lib/python3.6/_strptime.py in _strptime(data_string, format)
    361         raise ValueError("time data %r does not match format %r" %
--> 362                          (data_string, format))
    363     if len(data_string) != found.end():

ValueError: time data 'NEW GARDEN RESTAURANT' does not match format '%m/%d/%Y %I:%M:%S %p'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-60-4bd0710aff18> in <module>()
      1 df = pd.read_csv("Food_Establishment_Inspections.csv", low_memory=False, index_col='datetime', 
      2                  parse_dates={'year':[0]},
----> 3                  date_parser=lambda x:pd.datetime.strptime(x,'%m/%d/%Y %I:%M:%S %p').strftime('%Y'))

//anaconda/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    644                     skip_blank_lines=skip_blank_lines)
    645 
--> 646         return _read(filepath_or_buffer, kwds)
    647 
    648     parser_f.__name__ = name

//anaconda/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    399         return parser
    400 
--> 401     data = parser.read()
    402     parser.close()
    403     return data

//anaconda/lib/python3.6/site-packages/pandas/io/parsers.py in read(self, nrows)
    937                 raise ValueError('skipfooter not supported for iteration')
    938 
--> 939         ret = self._engine.read(nrows)
    940 
    941         if self.options.get('as_recarray'):

//anaconda/lib/python3.6/site-packages/pandas/io/parsers.py in read(self, nrows)
   1583             data = dict((k, v) for k, (i, v) in zip(names, data))
   1584 
-> 1585             names, data = self._do_date_conversions(names, data)
   1586             index, names = self._make_index(data, alldata, names)
   1587 

//anaconda/lib/python3.6/site-packages/pandas/io/parsers.py in _do_date_conversions(self, names, data)
   1362             data, names = _process_date_conversion(
   1363                 data, self._date_conv, self.parse_dates, self.index_col,
-> 1364                 self.index_names, names, keep_date_col=self.keep_date_col)
   1365 
   1366         return names, data

//anaconda/lib/python3.6/site-packages/pandas/io/parsers.py in _process_date_conversion(data_dict, converter, parse_spec, index_col, index_names, columns, keep_date_col)
   2724 
   2725             _, col, old_names = _try_convert_dates(converter, colspec,
-> 2726                                                    data_dict, orig_names)
   2727 
   2728             new_data[new_name] = col

//anaconda/lib/python3.6/site-packages/pandas/io/parsers.py in _try_convert_dates(parser, colspec, data_dict, columns)
   2756     to_parse = [data_dict[c] for c in colnames if c in data_dict]
   2757 
-> 2758     new_col = parser(*to_parse)
   2759     return new_name, new_col, colnames
   2760 

//anaconda/lib/python3.6/site-packages/pandas/io/parsers.py in converter(*date_cols)
   2672                         errors='ignore')
   2673                 except Exception:
-> 2674                     return generic_parser(date_parser, *date_cols)
   2675 
   2676     return converter

//anaconda/lib/python3.6/site-packages/pandas/io/date_converters.py in generic_parser(parse_func, *cols)
     36     for i in range(N):
     37         args = [c[i] for c in cols]
---> 38         results[i] = parse_func(*args)
     39 
     40     return results

<ipython-input-60-4bd0710aff18> in <lambda>(x)
      1 df = pd.read_csv("Food_Establishment_Inspections.csv", low_memory=False, index_col='datetime', 
      2                  parse_dates={'year':[0]},
----> 3                  date_parser=lambda x:pd.datetime.strptime(x,'%m/%d/%Y %I:%M:%S %p').strftime('%Y'))

//anaconda/lib/python3.6/_strptime.py in _strptime_datetime(cls, data_string, format)
    563     """Return a class cls instance based on the input string and the
    564     format string."""
--> 565     tt, fraction = _strptime(data_string, format)
    566     tzname, gmtoff = tt[-2:]
    567     args = tt[:6] + (fraction,)

//anaconda/lib/python3.6/_strptime.py in _strptime(data_string, format)
    360     if not found:
    361         raise ValueError("time data %r does not match format %r" %
--> 362                          (data_string, format))
    363     if len(data_string) != found.end():
    364         raise ValueError("unconverted data remains: %s" %

ValueError: time data 'NEW GARDEN RESTAURANT' does not match format '%m/%d/%Y %I:%M:%S %p'
dancemc15
  • 598
  • 2
  • 7
  • 21

2 Answers2

0

The pandas.read_csv() function has a keyword argument called parse_dates.

You can read more about it here: datetime dtypes in pandas read_csv

0

For starters, the format you are passing to strptime, '%Y %m %d %H' doesn't match the date/time string you gave as an example. Also, your datetime is only in one column, why are you passing [1,2,3,4] in the parse_dates argument?

Assuming the contents of your CSV are in a file called foo.csv (no header) and contain:

12/08/2011 11:04:13 AM,35
12/08/2011 11:04:13 AM,82
12/08/2011 11:04:13 AM,11
12/08/2011 11:04:13 AM,21
12/08/2011 11:04:13 AM,91
12/08/2011 11:04:13 AM,44

The following should work:

df = pd.read_csv('foo.csv',  parse_dates={'datetime':[0]}, date_parser=lambda x:pd.datetime.strptime(x,'%m/%d/%Y %I:%M:%S %p'))

If you only want to retain the year:

df = pd.read_csv('foo.csv',  parse_dates={'year':[0]}, date_parser=lambda x:pd.datetime.strptime(x,'%m/%d/%Y %I:%M:%S %p').strftime('%Y'))
P B
  • 85
  • 4
  • I tried your method, but there are new problems instead. And it says that one of the entry does not match the timestamp format. I updated my code above and the output. – dancemc15 Mar 09 '17 at 19:39
  • did you try the code with the example CSV I provided? You haven't provided your input data so hard to replicate. – P B Mar 09 '17 at 19:44
  • Okay. I updated it with a sample of my data. Sorry it's a bit hard to read. – dancemc15 Mar 09 '17 at 19:54
  • Okay so I tried using foo.csv, and it says: ValueError: time data '12/8/11 11:04' does not match format '%m/%d/%Y %I:%M:%S %p' – dancemc15 Mar 09 '17 at 19:59
  • Aha. That's a different format than what you originally pasted - the code and data I pasted above are compatible. You need to specify exactly the format you're expecting. %Y is a 4-digit year, %y is a 2-digit year; %p specifies AM/PM which your new format doesn't have. Your new format also doesn't have seconds, %s in my code above. Take a look at the formats here and adapt the code to your specific format: http://strftime.org/ – P B Mar 09 '17 at 20:54