27

I am trying to use NLTK toolkit to get extract place, date and time from text messages. I just installed the toolkit on my machine and I wrote this quick snippet to test it out:

sentence = "Let's meet tomorrow at 9 pm";
tokens = nltk.word_tokenize(sentence)
pos_tags = nltk.pos_tag(tokens)
print nltk.ne_chunk(pos_tags, binary=True)

I was assuming that it will identify the date (tomorrow) and time (9 pm). But, surprisingly it failed to recognize that. I get the following result when I run my above code:

(S (GPE Let/NNP) 's/POS meet/NN tomorrow/NN at/IN 9/CD pm/NN)

Can someone help me understand if I am missing something or NLTK is just not mature enough to tag time and date properly. Thanks!

Franck Dernoncourt
  • 77,520
  • 72
  • 342
  • 501
Darth.Vader
  • 5,079
  • 7
  • 50
  • 90

3 Answers3

30

The default NE chunker in nltk is a maximum entropy chunker trained on the ACE corpus (http://catalog.ldc.upenn.edu/LDC2005T09). It has not been trained to recognise dates and times, so you need to train your own classifier if you want to do that.

Have a look at http://mattshomepage.com/articles/2016/May/23/nltk_nec/, the whole process is explained very well.

Also, there is a module called timex in nltk_contrib which might help you with your needs. https://github.com/nltk/nltk_contrib/blob/master/nltk_contrib/timex.py

Viktor Vojnovski
  • 1,191
  • 1
  • 7
  • 19
6

Named entity recognition is not an easy problem, do not expect any library to be 100% accurate. You shouldn't make any conclusions about NLTK's performance based on one sentence. Here's another example:

sentence = "I went to New York to meet John Smith";

I get

(S
  I/PRP
  went/VBD
  to/TO
  (NE New/NNP York/NNP)
  to/TO
  meet/VB
  (NE John/NNP Smith/NNP))

As you can see, NLTK does very well here. However, I couldn't get NLTK to recognise today or tomorrow as temporal expressions. You can try Stanford SUTime, it is a part of Stanford CoreNLP- I have used it before I it works quite well (it is in Java though).

mbatchkarov
  • 15,487
  • 9
  • 60
  • 79
  • Actually NLTK provides bindings for Stanford's NERTagger (`from nltk.tag.stanford import StanfordNERTagger`). Still you need to download the java source but there is plenty of help out of there. – Pithikos Apr 04 '16 at 09:23
3

If you wish to correctly identify the date or time from the text messages you can use Stanford's NER.

It uses the CRF(Conditional Random Fields) Classifier. CRF is a sequential classifier. So it takes the sequences of words into consideration.

How you frame or design a sentence, accordingly you will get the classified data.

If your input sentence would have been Let's meet on wednesday at 9am., then Stanford NER would have correctly identified wednesday as date and 9am as time.

NLTK supports Stanford NER. Try using it.

Rohan Amrute
  • 764
  • 1
  • 9
  • 23