Parsing long form dates from string

Question

I am aware that there are other solutions to similar problems on stack overflow but they don't work in my particular situation.

I have some strings -- here are some examples of them.

string_with_dates = "random non-date text, 22 May 1945 and 11 June 2004"
string2 = "random non-date text, 01/01/1999 & 11 June 2004"
string3 = "random non-date text, 01/01/1990, June 23 2010"
string4 = "01/2/2010 and 25th of July 2020"
string5 = "random non-date text, 01/02/1990"
string6 = "random non-date text, 01/02/2010 June 10 2010"

I need a parser that can determine how many date-like objects are in the string and then parse them into actual dates into a list. I can't find any solutions out there. Here is desired output:


['05/22/1945','06/11/2004']

Or as actual datetiem objects. Any ideas?

I have tried the solutions listed here but they don't work. How to parse multiple dates from a block of text in Python (or another language)

Here is what happens when I try the solutions suggested in that link:


import itertools
from dateutil import parser

jumpwords = set(parser.parserinfo.JUMP)
keywords = set(kw.lower() for kw in itertools.chain(
    parser.parserinfo.UTCZONE,
    parser.parserinfo.PERTAIN,
    (x for s in parser.parserinfo.WEEKDAYS for x in s),
    (x for s in parser.parserinfo.MONTHS for x in s),
    (x for s in parser.parserinfo.HMS for x in s),
    (x for s in parser.parserinfo.AMPM for x in s),
))

def parse_multiple(s):
    def is_valid_kw(s):
        try:  # is it a number?
            float(s)
            return True
        except ValueError:
            return s.lower() in keywords

    def _split(s):
        kw_found = False
        tokens = parser._timelex.split(s)
        for i in xrange(len(tokens)):
            if tokens[i] in jumpwords:
                continue 
            if not kw_found and is_valid_kw(tokens[i]):
                kw_found = True
                start = i
            elif kw_found and not is_valid_kw(tokens[i]):
                kw_found = False
                yield "".join(tokens[start:i])
        # handle date at end of input str
        if kw_found:
            yield "".join(tokens[start:])

    return [parser.parse(x) for x in _split(s)]

parse_multiple(string_with_dates)

Output:


ParserError: Unknown string format: 22 May 1945 and 11 June 2004

Another method:


from dateutil.parser import _timelex, parser

a = "I like peas on 2011-04-23, and I also like them on easter and my birthday, the 29th of July, 1928"

p = parser()
info = p.info

def timetoken(token):
  try:
    float(token)
    return True
  except ValueError:
    pass
  return any(f(token) for f in (info.jump,info.weekday,info.month,info.hms,info.ampm,info.pertain,info.utczone,info.tzoffset))

def timesplit(input_string):
  batch = []
  for token in _timelex(input_string):
    if timetoken(token):
      if info.jump(token):
        continue
      batch.append(token)
    else:
      if batch:
        yield " ".join(batch)
        batch = []
  if batch:
    yield " ".join(batch)

for item in timesplit(string_with_dates):
  print "Found:", (item)
  print "Parsed:", p.parse(item)

Output:



ParserError: Unknown string format: 22 May 1945 11 June 2004

Any ideas?

What exactly is not working from the solutions you found from the link? — Trooper Z, Nov 16 '22 at 14:41
For all the methods in that link I get this error: " ParserError: Unknown string format: 22 May 1945 and 11 June 2004" — Data of All Kinds, Nov 16 '22 at 14:44
Could you show an example of what you have tried? Also, does the string with dates have consistent formatting between dates or is it varied? You will have to make sure you can parse those multiple scenarios. — Trooper Z, Nov 16 '22 at 14:52
Just updated to include the functions that I already tried and the errors they yielded — Data of All Kinds, Nov 16 '22 at 14:53
Try separating the two dates into separate strings using `.split()` and then parsing those dates individually. — Trooper Z, Nov 16 '22 at 14:54
Yes that is one approach -- but to answer your prior question -- yes there is a high degree of variability in what separates the dates and the format of the dates. Perhaps I should included more examples. — Data of All Kinds, Nov 16 '22 at 14:55
This seems like a hard problem to me, you need something that can parse English to separate the dates. — Mark Ransom, Nov 16 '22 at 15:20

Data of All Kinds · Accepted Answer · 2022-11-16T18:49:09.660

Okay sorry to anyone who spent time on this -- but I was able to answer my own question. Leaving this up in case anyone else has the same issue.

This package was able to work perfectly: https://pypi.org/project/datefinder/


import datefinder

def DatesToList(x):
    
    dates = datefinder.find_dates(x)
    
    lists = []
    
    for date in dates:
        
        lists.append(date)
        
    return (lists)

dates = DateToList(string_with_dates)

Output:


[datetime.datetime(1945, 5, 22, 0, 0), datetime.datetime(2004, 6, 11, 0, 0)]

Parsing long form dates from string

1 Answers1