19

I have a string that has several date values in it, and I want to parse them all out. The string is natural language, so the best thing I've found so far is dateutil.

Unfortunately, if a string has multiple date values in it, dateutil throws an error:

>>> s = "I like peas on 2011-04-23, and I also like them on easter and my birthday, the 29th of July, 1928"
>>> parse(s, fuzzy=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/pymodules/python2.7/dateutil/parser.py", line 697, in parse
    return DEFAULTPARSER.parse(timestr, **kwargs)
  File "/usr/lib/pymodules/python2.7/dateutil/parser.py", line 303, in parse
    raise ValueError, "unknown string format"
ValueError: unknown string format

Any thoughts on how to parse all dates from a long string? Ideally, a list would be created, but I can handle that myself if I need to.

I'm using Python, but at this point, other languages are probably OK, if they get the job done.

PS - I guess I could recursively split the input file in the middle and try, try again until it works, but it's a hell of a hack.

Zero Piraeus
  • 56,143
  • 27
  • 150
  • 160
mlissner
  • 17,359
  • 18
  • 106
  • 169
  • In your sample string are you considering "on easter" to be a date you want to parse? – MattH Aug 11 '11 at 15:59
  • Nah. Was testing to see if it worked, but I don't care too much either way. – mlissner Aug 12 '11 at 01:59
  • With DateUtil 1.5 it does work of course, my bad. But I would still like to award the one with a cleaner/faster approach than MattH Shawn Chin... – Dieter Nov 05 '12 at 13:47

5 Answers5

19

Looking at it, the least hacky way would be to modify dateutil parser to have a fuzzy-multiple option.

parser._parse takes your string, tokenizes it with _timelex and then compares the tokens with data defined in parserinfo.

Here, if a token doesn't match anything in parserinfo, the parse will fail unless fuzzy is True.

What I suggest you allow non-matches while you don't have any processed time tokens, then when you hit a non-match, process the parsed data at that point and start looking for time tokens again.

Shouldn't take too much effort.


Update

While you're waiting for your patch to get rolled in...

This is a little hacky, uses non-public functions in the library, but doesn't require modifying the library and is not trial-and-error. You might have false positives if you have any lone tokens that can be turned into floats. You might need to filter the results some more.

from dateutil.parser import _timelex, parser

a = "I like peas on 2011-04-23, and I also like them on easter and my birthday, the 29th of July, 1928"

p = parser()
info = p.info

def timetoken(token):
  try:
    float(token)
    return True
  except ValueError:
    pass
  return any(f(token) for f in (info.jump,info.weekday,info.month,info.hms,info.ampm,info.pertain,info.utczone,info.tzoffset))

def timesplit(input_string):
  batch = []
  for token in _timelex(input_string):
    if timetoken(token):
      if info.jump(token):
        continue
      batch.append(token)
    else:
      if batch:
        yield " ".join(batch)
        batch = []
  if batch:
    yield " ".join(batch)

for item in timesplit(a):
  print "Found:", item
  print "Parsed:", p.parse(item)

Yields:

Found: 2011 04 23
Parsed: 2011-04-23 00:00:00
Found: 29 July 1928
Parsed: 1928-07-29 00:00:00

Update for Dieter

Dateutil 2.1 appears to be written for compatibility with python3 and uses a "compatability" library called six. Something isn't right with it and it's not treating str objects as text.

This solution works with dateutil 2.1 if you pass strings as unicode or as file-like objects:

from cStringIO import StringIO
for item in timesplit(StringIO(a)):
  print "Found:", item
  print "Parsed:", p.parse(StringIO(item))

If you want to set option on the parserinfo, instantiate a parserinfo and pass it to the parser object. E.g:

from dateutil.parser import _timelex, parser, parserinfo
info = parserinfo(dayfirst=True)
p = parser(info)
MattH
  • 37,273
  • 11
  • 82
  • 84
  • Probably the most efficient solution. +1. Of course, modifying the library itself would make deployment/maintenance a little harder unless the changes are absorbed into the official source. – Shawn Chin Aug 11 '11 at 18:12
  • This is an amazing answer - by far the best I've ever gotten on SO. Will test it out and let you know how it goes. Thanks! – mlissner Aug 12 '11 at 02:00
  • This works quite well. It has TypeErrors and ValueErrors that I need to catch, and many false positives. The Errors are easy to catch, and I'm eliminating false positives by nuking anything from the current year (my corpus only has old dates). Thanks again. – mlissner Aug 12 '11 at 03:37
  • Do you know if there is any difference between DateUtil 1.5 & 2.1 other than compatibility for Python 3? In other words: I use Python 2.7; what should I use? – Dieter Nov 12 '12 at 13:59
  • 1
    There's a comment on [this page](http://labix.org/python-dateutil) under news that says that you should use DateUtil 1.x for Python 2.x. – MattH Nov 12 '12 at 14:29
  • This fails for input 'find all cases on 2nd Aug 2013 and 5th Aug 2014' – Praveen Jul 15 '13 at 12:40
  • 1
    @Praveen: This is because `and` is a `info.jump` token which is used to combine time tokens and doesn't necessarily indicate that a sequence of time tokens has concluded. I suggest you modify the behaviour of `timesplit` to treat `info.jump` as a break. I.e. `if timetoken(token) and not info.jump(token): batch.append(token)` – MattH Jul 15 '13 at 21:47
6

While I was offline, I was bothered by the answer I posted here yesterday. Yes it did the job, but it was unnecessarily complicated and extremely inefficient.

Here's the back-of-the-envelope edition that should do a much better job!

import itertools
from dateutil import parser

jumpwords = set(parser.parserinfo.JUMP)
keywords = set(kw.lower() for kw in itertools.chain(
    parser.parserinfo.UTCZONE,
    parser.parserinfo.PERTAIN,
    (x for s in parser.parserinfo.WEEKDAYS for x in s),
    (x for s in parser.parserinfo.MONTHS for x in s),
    (x for s in parser.parserinfo.HMS for x in s),
    (x for s in parser.parserinfo.AMPM for x in s),
))

def parse_multiple(s):
    def is_valid_kw(s):
        try:  # is it a number?
            float(s)
            return True
        except ValueError:
            return s.lower() in keywords

    def _split(s):
        kw_found = False
        tokens = parser._timelex.split(s)
        for i in xrange(len(tokens)):
            if tokens[i] in jumpwords:
                continue 
            if not kw_found and is_valid_kw(tokens[i]):
                kw_found = True
                start = i
            elif kw_found and not is_valid_kw(tokens[i]):
                kw_found = False
                yield "".join(tokens[start:i])
        # handle date at end of input str
        if kw_found:
            yield "".join(tokens[start:])

    return [parser.parse(x) for x in _split(s)]

Example usage:

>>> parse_multiple("I like peas on 2011-04-23, and I also like them on easter and my birthday, the 29th of July, 1928")
[datetime.datetime(2011, 4, 23, 0, 0), datetime.datetime(1928, 7, 29, 0, 0)]

It's probably worth noting that its behaviour deviates slightly from dateutil.parser.parse when dealing with empty/unknown strings. Dateutil will return the current day, while parse_multiple returns an empty list which, IMHO, is what one would expect.

>>> from dateutil import parser
>>> parser.parse("")
datetime.datetime(2011, 8, 12, 0, 0)
>>> parse_multiple("")
[]

P.S. Just spotted MattH's updated answer which does something very similar.

Community
  • 1
  • 1
Shawn Chin
  • 84,080
  • 19
  • 162
  • 191
  • This seemed more reliable initially than MattH's suggestion, but the performance was abysmal on larger tests (not surprisingly). Thanks for the help though! – mlissner Aug 12 '11 at 03:35
  • @mlissner you're welcome. It was a fun problem to solve. So much so that I was thinking over it last night and came up with what I believe is a better solution. See updated answer. – Shawn Chin Aug 12 '11 at 08:44
0

I think if you put the "words" in an array, it should do the trick. With that you can verify if it is a date or no, and put in a variable.

Once you have the date you should use datetime library library.

Tiago Moutinho
  • 1,372
  • 1
  • 13
  • 18
0

Why not writing a regex pattern covering all the possible forms in which a date can appear, and then launching the regex to explore the text ? I presume that there are not dozen of dozens of manners to express a date in a string.

The only problem is to gather the maximum of date's expressions

eyquem
  • 26,771
  • 7
  • 38
  • 46
0

I see that there are some good answers already but adding this one as it worked better in a use case of mine while the above answers didn't.

Using this library: https://datefinder.readthedocs.io/en/latest/index.html#module-datefinder


import datefinder

def DatesToList(x):
    
    dates = datefinder.find_dates(x)
    
    lists = []
    
    for date in dates:
        
        lists.append(date)
        
    return (lists)


dates = DateToList(s)


Output:

[datetime.datetime(2011, 4, 23, 0, 0), datetime.datetime(1928, 7, 29, 0, 0)]