3

I am looking to recognize simple phrases like the ones what happens in google calendar but rather than parsing Calendar Entries I have to parse Sentence related to finance, accounting and to do lists. So For example I have to parse sentences like

I spent 50 dollars on food yesterday

I need to mark an separate the info as Reason : 'food' , Cost : 50 and Time: <Yesterday's Date>

My question is do I go in for a full fledged Natural Language Processing like given in these Questions and use Something like GATE

Machine Learning and Natural Language Processing

Natural Language Processing in Ruby

Ideas for Natural Language Processing project?

https://stackoverflow.com/a/3058063/492561

Or is it better to Write simple grammars using Something like AntLR and try to recognize it .

Or should I go really low and just Define a syntax and use Regular expressions .

Time is a Constraint , I have about 45 - 50 Days , And I don't know how to use AntLR or NLP libraries like GATE.

Preferred languages : Python , Java , Ruby (Not in any particular order)

PS : This is not home-work , So please Don't tag it as so.

PPS : Please try to give an answer with Facts on why using a particular method is better. even if a particular method may not fit inside the time constraint please feel free to share it because It might benefit someone else .

Community
  • 1
  • 1
Gautam
  • 7,868
  • 12
  • 64
  • 105
  • You are really looking for a natural language processing grammar here... If Java, you could consider parboiled. But defining a _grammar_ will be the hardest part of all, whatever the tool you use. Good luck! – fge Jan 13 '12 at 12:34
  • Thanks for that @fge , Could you please elaborate on that with some links if possible , and post it as an answer – Gautam Jan 13 '12 at 12:35
  • @fge, after mentioning NLP, you talk about `parboiled`, but this tool is a PEG-parser, not a NLP tool. If the OP chooses to use a NLP tool, there's probably no need to tinker with any grammars: such tools come shipped with a couple of predefined languages (grammars) already. – Bart Kiers Jan 13 '12 at 12:39
  • @BartKiers which is why I said that you'd need to define a grammar... It is not undoable, but extremely difficult. – fge Jan 13 '12 at 12:44
  • @fge, err, no, not when using a NLP tool(kit): these already contain languages (grammars). – Bart Kiers Jan 13 '12 at 12:58
  • @BartKiers I see your point -- I just say you'd have to define a language grammar if you went on using parboiled... And tbh I don't know any NLP tool for Java – fge Jan 13 '12 at 13:00

2 Answers2

4

You could look at named entity recognition indeed. From your question I understand your domain is pretty well defined, so you can identify the (few?) entities (dates, currencies, money amount, time expressions, etc.) that are relevant for you. If the phrases are very simple, you could go with a rule-based approach, otherwise it's likely to get too complex too soon.

Just to get yourself up and running in a few sec, http://timmcnamara.co.nz/post/2650550090/extracting-names-with-6-lines-of-python-code is an extremely nice example of what you could do. Of course I would not expect an high accuracy from just 6 lines of python, but it should give you an idea of how it works:

1>>> import nltk
2>>> def extract_entities(text):
3...     for sent in nltk.sent_tokenize(text):
4...         for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
5...             if hasattr(chunk, 'node'):
6...                 print chunk.node, ' '.join(c[0] for c in chunk.leaves())

The core idea is on line 3 and 4: on line 3 it split text in sentences and iterates them. On line 4, it splits the sentence in tokens, it runs "part of speech" tagging on the sentence, and then it feeds the pos-tagged sentence to the named entity recognition algorithm. That's the very basic pipeline.

In general, nltk is an extremely beautiful piece of software, and very well documented: I would look at it. Other answers contain very useful links.

Savino Sguera
  • 3,522
  • 21
  • 20
3

Your task is a type of Information Extraction task, specifically relation/fact extraction, preceded by Named Entity Recognition.

Take a look at the following frameworks for Java/Python:

cyborg
  • 9,989
  • 4
  • 38
  • 56