6

I'm trying to use the Earley parser in NLTK to parse sentences such as:

If date is before 12/21/2010 then serial = 10

To do this, I'm trying to write a CFG but the problem is I would need to have a general format of dates and integers as terminals, instead of the specific values. Is there any ways to specify the right hand side of a production rule as a regular expression, which would allow this kind of processing?

Something like:

S -> '[0-9]+'

which would handle all integers.

hippietrail
  • 15,848
  • 18
  • 99
  • 158
FahimH
  • 151
  • 1
  • 7
  • Your date format is locale dependant. And mainly is ambigous (collide with a mathematical expression 12 div 21 div 2010 which is probably not that you want – VGE Dec 25 '10 at 10:02
  • You're right but that will be easy to handle since the input will never contain any mathematical expressions like what you mentioned. Also the date format will be fixed, say, MM/DD/YYYY. I found a way to handle integers, but I'm still looking for a proper solution for dates. – FahimH Jan 03 '11 at 04:20

1 Answers1

2

For this to work, you'll need to tokenize the date so that each digit and slash is a separate token.

from nltk.parse.earleychart import EarleyChartParser
import nltk

grammar = nltk.parse_cfg("""
DATE -> MONTH SEP DAY SEP YEAR
SEP -> "/"
MONTH -> DIGIT | DIGIT DIGIT
DAY -> DIGIT | DIGIT DIGIT
YEAR -> DIGIT DIGIT DIGIT DIGIT
DIGIT -> '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' | '0'
""")

parser = EarleyChartParser(grammar)
print parser.parse(["1", "/", "1", "0", "/", "1", "9", "8", "7"])

The output is:

(DATE
  (MONTH (DIGIT 1))
  (SEP /)
  (DAY (DIGIT 1) (DIGIT 0))
  (SEP /)
  (YEAR (DIGIT 1) (DIGIT 9) (DIGIT 8) (DIGIT 7)))

This also affords some flexibility in the form of allowing dates and months to be single-digit.

gregsabo
  • 430
  • 2
  • 9