27

The dateutil parser does a great job of correctly guessing the date and time from a wide variety of sources.

We are processing files in which each file uses only one date/time format, but the format varies between files. Profiling shows a lot of time being used by dateutil.parser.parse. Since it only needs to be determined once per file, implementing something that isn't guessing the format each time could speed things up.

I don't actually know the formats in advance and I'll still need to infer the format. Something like:

from MysteryPackage import date_string_to_format_string
import datetime

# e.g. mystring = '1 Jan 2016'
myformat = None

...

# somewhere in a loop reading from a file or connection:
if myformat is None:
    myformat = date_string_to_format_string(mystring)

# do the usual checks to see if that worked, then:
mydatetime = datetime.strptime(mystring, myformat)

Is there such a function?

wim
  • 338,267
  • 99
  • 616
  • 750
Jason
  • 2,507
  • 20
  • 25
  • Look here: https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior – FrankTheTank_12345 Jun 02 '17 at 05:45
  • 2
    It's not possible with `dateutil`. Check out `dateinfer`, as mentioned in the dupe. If you find a better solution, post an answer there too! – wim Jun 02 '17 at 05:59
  • Thanks all. I didn't find the earlier question; now I know. :) – Jason Jun 02 '17 at 06:00
  • As you say in the dupe, there are limits to dateinfer: e.g. `dateinfer.infer(['2001-04-04T19:00:02.0902Z'])` gives `'%Y-%d-%mT%H:%M:%S.%YZ'`. That last `%Y` is troubling... – Jason Jun 02 '17 at 06:11
  • if `my_string` is always in the format 1 Jan 2016, then you can use ` import datetime;` ` if myformat is None:` `mydatetime = datetime.datetime.strptime(my_string, '%d %b %Y')` # Results in: datetime.datetime(2016, 1, 1, 0, 0) # mydatetime.year is 2016 # mydatetime.month is 1 # mydatetime.day is 1 – theBuzzyCoder Jun 02 '17 at 06:45
  • From the question: `Since I don't actually know the formats in advance, I'll still need to infer the format.` So the `my_string` is just to illustrate one of many possible forms of the same problem. – Jason Jun 02 '17 at 07:35
  • I agree that `dateinfer` results are garbage. I've reopened your question, since the [duplicate](https://stackoverflow.com/q/34073502/674039) - whilst an exact duplicate - doesn't really have any acceptable answer. – wim Jun 05 '17 at 16:49
  • 1
    Have you tried [dateparser](https://pypi.python.org/pypi/dateparser). Seems legit. – zipa Jun 05 '17 at 17:02
  • Can you provide some real world example strings? One way would be to parse the first string with a regular expression and save which one fitted. – Jan Jun 05 '17 at 18:33
  • with `Since I don't actually know the formats in advance, I'll still need to infer the format.`, do you mean per file or in general?, Are you able to give a(n exhaustive) list of formats to expect, so we know whether to take `dayfirst` or `monthfirst` into account and other nuances – Maarten Fabré Jun 06 '17 at 15:02
  • 1
    In my use case, we can assume the format will be the same in every date entry per file. I'm in Australia (a `dayfirst` locale), but we often deal with inflexible software that is `monthfirst` by default and tricky to change, so not all the users switch. In my dream package, a default could be assumed. Even better might be to "vote" using a collection dates from the file, and a warning if the results were still unclear. – Jason Jun 06 '17 at 21:44
  • `dateutil` has been amazing at guessing the correct date, but is also the bottleneck, and using it for every date makes the code slower by an order of magnitude. It would be great if either `dateuitl` could show what format it found, or if there was a way to interpolate a format, given a date string and, say, a datetime object. – Jason Jun 06 '17 at 21:54
  • 2
    Using multiple data points for analysis seems like a good approach. You could do this with a stateful class (create an instance per file), that trains itself on the `strftime` template as each new date come in (checking whether the `strptime` result and dateutil parser result are in agreement), it can attempt to use the current template, and fall back on `dateutil.parser` on failure. Presumably it will converge on a suitable `strftime` string which can be used without failure for the remainder of the file. – wim Jun 07 '17 at 19:12

4 Answers4

10

This is a tricky one. My approach makes use of regular expressions and the (?(DEFINE)...) syntax which is only supported by the newer regex module.


Essentially, DEFINE let us define subroutines prior to matching them, so first of all we define all needed bricks for our date guessing function:
    (?(DEFINE)
        (?P<year_def>[12]\d{3})
        (?P<year_short_def>\d{2})
        (?P<month_def>January|February|March|April|May|June|
        July|August|September|October|November|December)
        (?P<month_short_def>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
        (?P<day_def>(?:0[1-9]|[1-9]|[12][0-9]|3[01]))
        (?P<weekday_def>(?:Mon|Tue|Wednes|Thurs|Fri|Satur|Sun)day)
        (?P<weekday_short_def>Mon|Tue|Wed|Thu|Fri|Sat|Sun)
        (?P<hms_def>\d{2}:\d{2}:\d{2})
        (?P<hm_def>\d{2}:\d{2})
            (?P<ms_def>\d{5,6})
            (?P<delim_def>([-/., ]+|(?<=\d|^)T))
        )
        # actually match them
        (?P<hms>^(?&hms_def)$)|(?P<year>^(?&year_def)$)|(?P<month>^(?&month_def)$)|(?P<month_short>^(?&month_short_def)$)|(?P<day>^(?&day_def)$)|
        (?P<weekday>^(?&weekday_def)$)|(?P<weekday_short>^(?&weekday_short_def)$)|(?P<hm>^(?&hm_def)$)|(?P<delim>^(?&delim_def)$)|(?P<ms>^(?&ms_def)$)
        """, re.VERBOSE)

After this, we need to think of possible delimiters:

# delim
delim = re.compile(r'([-/., ]+|(?<=\d)T)')

Format mapping:

# formats
formats = {'ms': '%f', 'year': '%Y', 'month': '%B', 'month_dec': '%m', 'day': '%d', 'weekday': '%A', 'hms': '%H:%M:%S', 'weekday_short': '%a', 'month_short': '%b', 'hm': '%H:%M', 'delim': ''}

The function GuessFormat() splits the parts with the help of the delimiters, tries to match them and outputs the corresponding code for strftime():

def GuessFormat(datestring):

    # define the bricks
    bricks = re.compile(r"""
            (?(DEFINE)
                (?P<year_def>[12]\d{3})
                (?P<year_short_def>\d{2})
                (?P<month_def>January|February|March|April|May|June|
                July|August|September|October|November|December)
                (?P<month_short_def>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
                (?P<day_def>(?:0[1-9]|[1-9]|[12][0-9]|3[01]))
                (?P<weekday_def>(?:Mon|Tue|Wednes|Thurs|Fri|Satur|Sun)day)
                (?P<weekday_short_def>Mon|Tue|Wed|Thu|Fri|Sat|Sun)
                (?P<hms_def>T?\d{2}:\d{2}:\d{2})
                (?P<hm_def>T?\d{2}:\d{2})
                (?P<ms_def>\d{5,6})
                (?P<delim_def>([-/., ]+|(?<=\d|^)T))
            )
            # actually match them
            (?P<hms>^(?&hms_def)$)|(?P<year>^(?&year_def)$)|(?P<month>^(?&month_def)$)|(?P<month_short>^(?&month_short_def)$)|(?P<day>^(?&day_def)$)|
            (?P<weekday>^(?&weekday_def)$)|(?P<weekday_short>^(?&weekday_short_def)$)|(?P<hm>^(?&hm_def)$)|(?P<delim>^(?&delim_def)$)|(?P<ms>^(?&ms_def)$)
            """, re.VERBOSE)

    # delim
    delim = re.compile(r'([-/., ]+|(?<=\d)T)')

    # formats
    formats = {'ms': '%f', 'year': '%Y', 'month': '%B', 'month_dec': '%m', 'day': '%d', 'weekday': '%A', 'hms': '%H:%M:%S', 'weekday_short': '%a', 'month_short': '%b', 'hm': '%H:%M', 'delim': ''}

    parts = delim.split(datestring)
    out = []
    for index, part in enumerate(parts):
        try:
            brick = dict(filter(lambda x: x[1] is not None, bricks.match(part).groupdict().items()))
            key = next(iter(brick))

            # ambiguities
            if key == 'day' and index == 2:
                key = 'month_dec'

            item = part if key == 'delim' else formats[key]
            out.append(item)
        except AttributeError:
            out.append(part)

    return "".join(out)

A test in the end:

import regex as re

datestrings = [datetime.now().isoformat(), '2006-11-02', 'Thursday, 10 August 2006 08:42:51', 'August 9, 1995', 'Aug 9, 1995', 'Thu, 01 Jan 1970 00:00:00', '21/11/06 16:30', 
'06 Jun 2017 20:33:10']

# test
for dt in datestrings:
    print("Date: {}, Format: {}".format(dt, GuessFormat(dt)))

This yields:

Date: 2017-06-07T22:02:05.001811, Format: %Y-%m-%dT%H:%M:%S.%f
Date: 2006-11-02, Format: %Y-%m-%d
Date: Thursday, 10 August 2006 08:42:51, Format: %A, %m %B %Y %H:%M:%S
Date: August 9, 1995, Format: %B %m, %Y
Date: Aug 9, 1995, Format: %b %m, %Y
Date: Thu, 01 Jan 1970 00:00:00, Format: %a, %m %b %Y %H:%M:%S
Date: 21/11/06 16:30, Format: %d/%m/%d %H:%M
Date: 06 Jun 2017 20:33:10, Format: %d %b %Y %H:%M:%S
Jan
  • 42,290
  • 8
  • 54
  • 79
  • 1
    Right, but just adding a new format every time is kind of missing the point. The whole idea is to infer the format, as dateutil parser already can, without having to manually update the parser definitions every time a new file or date format comes along. I also don't really understand what `(DEFINE)` is doing here, the answer could do with some explanation about that. – wim Jun 06 '17 at 19:10
  • @wim: Right you are and I have thought of another approach: splitting the parts in smaller bricks and analyzing them individually - see the updated answer with not fixed formats anymore. – Jan Jun 07 '17 at 12:48
  • It's getting there. Still needs some polish, and to handle failures more gracefully. e.g. `GuessFormat(datetime.now().isoformat())` is crashing with unhandled `AttributeError: 'NoneType' object has no attribute 'groupdict'` – wim Jun 07 '17 at 19:07
  • @wim: True, haven't implemented any error checks yet - will update as soon as I get to it. – Jan Jun 07 '17 at 19:10
  • @wim: Shall the `T` occur in the output as well? – Jan Jun 07 '17 at 19:36
  • Yes. I would expect something like `'%Y-%m-%dT%H:%M:%S.%f'`. – wim Jun 07 '17 at 19:48
8

I don't have a ready-made solution, but this is a very tricky problem and since too many brain-hours have already been spent on dateutil, instead of trying to replace that, I'll propose an approach that incorporates it:

  1. Read the first N records and parse each date using dateutil
  2. For each date part, note where in the string the value shows up
  3. If all (or >90%) date part locations match (like "YYYY is always after DD, separated by a comma and space"), turn that info into a strptime format string
  4. Switch to using datetime.strptime() with a relatively good level of confidence that it will work with the rest of the file

Since you stated that "each file uses only one date/time format", this approach should work (assuming you have different dates in each file so that mm/dd ambiguity can be resolved by comparing multiple date values).

Cahit
  • 2,484
  • 19
  • 23
  • This is similar to what I mentioned in a comment earlier [here](https://stackoverflow.com/questions/44321601/how-to-determine-appropriate-strftime-format-from-a-date-string#comment75840369_44321601). Some code samples would be helpful. – wim Jun 13 '17 at 16:30
  • As a starting point for (2), I've had luck using `pieces = re.split(r'\W+', date_string)`. From there, I can answer the question of "where in the string the value shows up" with `pieces.index()`. – jobo3208 Jul 17 '18 at 14:52
  • @Cahit , would you please clarify more your answer an example would be appreciated – Harvester Haidar May 18 '19 at 12:11
  • @HarvesterHaidar, the implementation would need to be specific to the problem/file at hand, so a generic example is not practical at the moment. The basic idea is to develop an acceptable level of confidence for the date format in use after reading enough sample records and identifying/verifying that they all use the same formatting. – Cahit May 22 '19 at 19:05
4

You can write your own parser:

import datetime

class DateFormatFinder:
    def __init__(self):
        self.fmts = []

    def add(self,fmt):
        self.fmts.append(fmt)

    def find(self, ss):
        for fmt in self.fmts:            
            try:
                datetime.datetime.strptime(ss, fmt)
                return fmt
            except:
                pass
        return None

You can use it as follows:

>>> df = DateFormatFinder()
>>> df.add('%m/%d/%y %H:%M')
>>> df.add('%m/%d/%y')
>>> df.add('%H:%M')

>>> df.find("01/02/06 16:30")
'%m/%d/%y %H:%M'
>>> df.find("01/02/06")
'%m/%d/%y'
>>> df.find("16:30")
'%H:%M'
>>> df.find("01/02/06 16:30")
'%m/%d/%y %H:%M'
>>> df.find("01/02/2006")

However, It is not so simple as dates can be ambiguous and their format can not be determined without some context.

>>> datetime.strptime("01/02/06 16:30", "%m/%d/%y %H:%M") # us format
datetime.datetime(2006, 1, 2, 16, 30)
>>> datetime.strptime("01/02/06 16:30", "%d/%m/%y %H:%M") # european format
datetime.datetime(2006, 2, 1, 16, 30)
Michael Mior
  • 28,107
  • 9
  • 89
  • 113
napuzba
  • 6,033
  • 3
  • 21
  • 32
  • This is essentially the same suggestion as offered in the dupe [here](https://stackoverflow.com/a/34089645/674039). **It is incorrect to assume that a format string is correct, just because `strptime` doesn't raise any exception**. Any acceptable solution needs more sophistication than that, perhaps using multiple data points. – wim Jun 05 '17 at 20:21
  • You are right - It is not a general solution. As I already said, the date can be ambiguous. However, If the dates formats are known and are not ambiguous - It can be a valid solution. – napuzba Jun 05 '17 at 20:32
0

You can try dateinfer module. It's pretty nice and deals with all your cases easily. Though if you need more fine control over the code and are ready to write it from scratch, regex does look like the best option.

  • `dateinfer` is mentioned in the comments under the original question. At the time, it gave a lot of garbage results. – Jason Sep 24 '20 at 10:32
  • yeah well it does have issues and I had to tweak it according to my use case. In that regard I found it helpful as I didn't have to write my logic from scratch. and didn't notice in it in the comments, my bad. :P – Rishabh Bhardwaj Sep 24 '20 at 20:09