-1

I'm looking at files of data. The data always has some form of timestamp that is consistent throughout a given file, but the format varies widely from file to file.

A given file often starts with something like 01/02/17. It is easy to guess that the year is last but is it DMY or MDY? Some later timestamp might be 27/02/17 which is unambiguous, how can I then go back and make sure 01/02/17 was parsed DMY, or MDY if I later found 01/31/17?

I tried to do a wrapper around dateutil.parser but I could not find a way to get any format information back (even just that the format was ambiguous and so the default order was used)

Somethings that I have attempted: 1)

DATETIMEFORMAT = '%m/%d/%y %H:%M'
df[DATETIME] = pd.to_datetime(df[DATETIME],format=DATETIMEFORMAT)

where I have to go in and manually change DATETIMEFORMAT for each file after looking at the file.

2)

datecol = []
for idx in range(df.shape[0]):
  tmpdate = parser.parse(df[DATETIME][idx])
  datecol.append(pd.to_datetime(tmpdate))
df[DATETIME] = datecol

using python's dateutil.parser, this will handle most timestamps correctly but I can't set a default order that will hold for all files. I was working to get format information out when I came across

[https://stackoverflow.com/questions/53892450/get-the-format-in-dateutil-parse][1]

  1. a combination of the two
for idx in range(df.shape[0]):
  tmpdate = parser.parse(df[DATETIME][idx])
  if not is_ambiguous(tmpdate):
     formatstr = get_format_str(df,tmpdate,idx)
     break

df[DATETIME] = pd.to_datetime(df[DATETIME],format=formatstr)

where I look for an unambiguous timestamp and try to use that to recreate the format string for pd.to_datetime().

Ddavid
  • 9
  • 1

1 Answers1

1

The unfortunate news is ambiguous datetimes can be impossible to parse accurately. In theory, adding logic to perform sanity checks (checking if month is higher than 12, etc) could work, although this is not reliable.

A decent option is to map specific files to datetime formats, to specify when parsing. For example:

File Format
FileA mm-dd-yy
FileB dd-mm-yy

Your program logic can look up what format to use when parsing timestamps for a specific file.

Otherwise, converting whatever is producing the files upstream to use ISO-8601 will make things easier in the future for everyone.

Jack Casey
  • 1,628
  • 11
  • 18