3

Does anyone know if there's a Python text parser that recognises embedded dates? For instance, given a sentence

"bla bla bla bla 12 Jan 14 bla bla bla 01/04/15 bla bla bla"

the parser could pick out the two date occurrences. I know of some Java tools, but are there Python ones? Would NTLK be an overkill?

Thanks

Kar
  • 6,063
  • 7
  • 53
  • 82
  • possible duplicate of [Natural language time parser](http://stackoverflow.com/questions/11340963/natural-language-time-parser) – dani herrera Jun 26 '15 at 10:17
  • @danihp I don't think so. I'm looking for a parser that extracts dates embedded in arbitrary strings rather than parsing a date string into an object. – Kar Jun 26 '15 at 10:25
  • @Kar the year are they gonna be like 14 and not 2014 in all cases? – Ajay Jun 26 '15 at 11:01
  • @Ajay Unfortunately, no. The format could be arbitrary. – Kar Jun 26 '15 at 11:03

1 Answers1

3

Here is an attempt to nondeterministically (read: exhaustively) solve the problem of finding where the dates are in the tokenized text. It enumerates all ways of partitioning the sentence (as list of tokens), with partition size from minps to maxps.

Each partitioning is run into the parser, which outputs a list of parsed dates, and the token range where it was parsed.

Each parser output is scored with sum of token ranges squared (so to prefer a date parsed from 4 tokens rather than 2 dates parsed from 2 tokens each).

Finally, it find and outputs the parse with best score.

The three building blocks of the algorithm:

from dateutil.parser import parse as parsedate

def partition(lst, minps, maxps, i=0):
    if lst == []:
        yield []
    else:
        try:
            for l in range(minps, maxps+1):
                if l > len(lst): continue
                for z in partition(lst[l:], minps, maxps, i+l):
                    yield [(i, lst[:l])] + z
        except:
            pass

def parsedates(p):
    for x in p:
        i, pi = x
        try:
            d = parsedate(' '.join(pi))
            # output: (startIndex, endIndex, parsedDate)
            if d: yield i, i+len(pi), d
        except: pass

def score(p):
    score = 0
    for pi in p:
        score += (pi[1]-pi[0])**2
    return score

Finding the parse with best score:

def bestparse(toks, maxps=3):
    bestscore = 0
    bestparse = None
    for ps in partition(toks, 1, maxps):
        l = list(parsedates(ps))
        s = score(l)
        if s > bestscore:
            bestscore = s
            bestparse = l
    return bestparse

Some tests:

l=['bla', 'bla', 'bla', '12', 'Jan', '14', 'bla', 'bla', 'bla', '01/04/15', 'bla', 'bla']
for bpi in bestparse(l):
    print('found date %s at tokens %s' % (bpi[2], ','.join(map(str, range(*bpi[:2])))))

found date 2014-01-12 00:00:00 at tokens 3,4,5

found date 2015-01-04 00:00:00 at tokens 9

l=['Fred', 'was', 'born', 'on', '23/1/99', 'at', '23:30']
for bpi in bestparse(l, 5):
    print('found date %s at tokens %s' % (bpi[2], ','.join(map(str, range(*bpi[:2])))))

found date 1999-01-23 23:30:00 at tokens 3,4,5,6

Beware that this can be very computationally expensive, so you may want to run it on single short phrases, not on the whole document. You may even want to split long phrases in chunks.

Another point for improvement is the partitioning function. If you have prior information like how many dates can be at most in a single sentence, the number of ways of partitioning it can be greatly reduced.

fferri
  • 18,285
  • 5
  • 46
  • 95
  • I'm not too certain how reliable it is to rely on `dateutil.parser` for date identification. `parsedate('1')` gives 2015-06-01 and `parsedate('1 1')` gives 2015-01-01. That seems a bit contrived, doesn't it? – Kar Jun 27 '15 at 09:56
  • @kar: looks like you haven't understood how the algorithm work – fferri Jun 27 '15 at 10:00
  • Given `'bla 1 2 bla 1/14'` it'd give two occurrences. I don't think '1 2' should be seen as a date occurrence. – Kar Jun 27 '15 at 10:07
  • `1/14` could be something else than a date as well (e.g. a fractional number). alas. – fferri Jun 27 '15 at 12:10