Here is an attempt to nondeterministically (read: exhaustively) solve the problem of finding where the dates are in the tokenized text. It enumerates all ways of partitioning the sentence (as list of tokens), with partition size from minps
to maxps
.
Each partitioning is run into the parser, which outputs a list of parsed dates, and the token range where it was parsed.
Each parser output is scored with sum of token ranges squared (so to prefer a date parsed from 4 tokens rather than 2 dates parsed from 2 tokens each).
Finally, it find and outputs the parse with best score.
The three building blocks of the algorithm:
from dateutil.parser import parse as parsedate
def partition(lst, minps, maxps, i=0):
if lst == []:
yield []
else:
try:
for l in range(minps, maxps+1):
if l > len(lst): continue
for z in partition(lst[l:], minps, maxps, i+l):
yield [(i, lst[:l])] + z
except:
pass
def parsedates(p):
for x in p:
i, pi = x
try:
d = parsedate(' '.join(pi))
# output: (startIndex, endIndex, parsedDate)
if d: yield i, i+len(pi), d
except: pass
def score(p):
score = 0
for pi in p:
score += (pi[1]-pi[0])**2
return score
Finding the parse with best score:
def bestparse(toks, maxps=3):
bestscore = 0
bestparse = None
for ps in partition(toks, 1, maxps):
l = list(parsedates(ps))
s = score(l)
if s > bestscore:
bestscore = s
bestparse = l
return bestparse
Some tests:
l=['bla', 'bla', 'bla', '12', 'Jan', '14', 'bla', 'bla', 'bla', '01/04/15', 'bla', 'bla']
for bpi in bestparse(l):
print('found date %s at tokens %s' % (bpi[2], ','.join(map(str, range(*bpi[:2])))))
found date 2014-01-12 00:00:00 at tokens 3,4,5
found date 2015-01-04 00:00:00 at tokens 9
l=['Fred', 'was', 'born', 'on', '23/1/99', 'at', '23:30']
for bpi in bestparse(l, 5):
print('found date %s at tokens %s' % (bpi[2], ','.join(map(str, range(*bpi[:2])))))
found date 1999-01-23 23:30:00 at tokens 3,4,5,6
Beware that this can be very computationally expensive, so you may want to run it on single short phrases, not on the whole document. You may even want to split long phrases in chunks.
Another point for improvement is the partitioning function. If you have prior information like how many dates can be at most in a single sentence, the number of ways of partitioning it can be greatly reduced.