1

What I want to do is to parse raw natural text and find all the phrases that describe dates.

I've got a fairly big corpus with all the references to dates marked up:

I met him <date>yesterday</date>.
Roger Zelazny was born <date>in 1937</date>
He'll have a hell of a hangover <date>tomorrow morning</date>

I don't want to interpret the date phrases, just locate them. The fact that they're dates is irrelevant (in real life they're not even dates but I don't want to bore you with the details), basically it's just an open-ended set of possible values. The grammar of the values themselves can be approximated as context-free, however it's quite complicated to build manually and with increasing complexity it gets increasingly hard to avoid false positives.

I know this is a bit of a long shot so I'm not expecting an out-of-the-box solution to exist out there, but what technology or research can I potentially use?

John Lehmann
  • 7,975
  • 4
  • 58
  • 71
biziclop
  • 48,926
  • 12
  • 77
  • 104
  • See question http://stackoverflow.com/questions/9294926/how-does-apple-find-dates-times-and-addresses-in-emails. This is called Named Entity Extraction, as a subtask in Information Extraction. @reseter provided the link. Both machine learning as well as grammar-based approaches work well. – John Lehmann Mar 13 '12 at 13:42
  • have a look at https://duckling.wit.ai/ – Saurabh Jain Jul 23 '16 at 10:23
  • @sdream Thanks, this looks promising too, I'm going to give it a try. – biziclop Jul 23 '16 at 10:42

2 Answers2

6

One of the generic approaches used in academia and in industry is based on Conditional Random Fields. Basically, it is a special probabilistic model, you train it first with your marked up data and then it can label certain types of entities in a given text.

You can even try one of the systems from Stanford Natural Language Processing Group: Stanford Named Entity Recognizer

When you download the tool, note there are several models, you need the last one:

Included with the Stanford NER are a 4 class model trained for CoNLL, a 7 class model trained for MUC, and a 3 class model trained on both data sets for the intersection of those class sets.

3 class Location, Person, Organization

4 class Location, Person, Organization, Misc

7 class Time, Location, Organization, Person, Money, Percent, Date

Update. You can actually try that tool online here. Select the muc.7class.distsim.crf.ser.gz classifier and try some text with dates. It doesn't seem to recognize "yesterday", but it recognizes "20th century", for example. In the end, this is a matter of CRF training.


Stanford NER screenshot

esmiralha
  • 10,587
  • 1
  • 16
  • 20
Massimiliano
  • 16,770
  • 10
  • 69
  • 112
4

Keep in mind CRFs are rather slow to train and require human-annotated data, so doing it yourself is not easy. Read the answers to this for another example of how people often do it in practice- not much in common with current academic research.

Community
  • 1
  • 1
mbatchkarov
  • 15,487
  • 9
  • 60
  • 79
  • 1
    Every algorithm will need some human-annotated data to start with... if computers could classify themselves there would be no need in any of those algorithms =) – Massimiliano Mar 13 '12 at 00:05
  • But different algos have different characteristics with regards to the training performance as well as applicability, data formats and error rates, so +1 for a good option to consider. – Massimiliano Mar 13 '12 at 00:08
  • It's definitely something I will try as well, luckily I have thousands of hand-annotated files so there's a lot of data to play around with. Error rates are likely to decide between the different methods. – biziclop Mar 13 '12 at 00:43
  • Is anyone aware of work comparing the two approaches? I'd really like to know what the recall of the regex method is like. – mbatchkarov Mar 13 '12 at 19:35