1

I am trying to build a system which identifies various commands and inputs based on a written human-entered text. I'll start with an example, to make things cleaner. Suppose the user inputs the following text:

My name is John Doe, my age is 28 years old, my address is Barkley Street no. 7 Havana. I like chocolate cake with strawberries and vanilla.

Based on a set of predefined markers (e.g. "name is", "age is", "address is", "I like"), I would like to detect their corresponding value (e.g. "John Doe", "28", "Barkley Street... Havana", "chocolate cake ... vanilla").

My current attempt was to tackle this via some regex patterns: for each marker I built a regex saying something along the lines of "if you find marker X, take all the text between it and any of the X, Y, Z markers you could find". That was extracting text between markers, but building everything based on regexes is going to be very cumbersome, especially if I start taking flexing and small variations into account.

I don't have much experience with NLP, so I'm not really sure where I should start for a proper solution. What are some appropriate approaches/solutions/libraries for tackling this problem?

ozborn
  • 980
  • 5
  • 24
Cosmin SD
  • 1,467
  • 3
  • 14
  • 21
  • While I might not know NLP very well, I can tell you this with 100% certainty: regexes are ***NOT*** powerful enough to do reliable text interpretation. Natural languages are extremely complex and often contradictory, and even more powerful constructs than regexes (like [context-free grammars](https://en.wikipedia.org/wiki/Context-free_grammar)) are insufficient to process natural-language texts. – Sebastian Lenartowicz Sep 07 '16 at 20:57
  • You could try to split up the string based on the markers. http://www.regexformat.com/version_files/Rx5_ScrnSht01.jpg –  Sep 07 '16 at 21:09
  • Then you could further split the results based on different markers until you can formulate. Recap: Split on primary markers, then split on secondary markers (which could be a subset of the primary). –  Sep 07 '16 at 21:21
  • Well, I pretty much agree with Sebastian that regexes are not powerful and appropriate enough for this scenario. However, I'm not really sure where to start with regards to alternatives. The problem with markers is that, even if I enforce a set of "rules", people can still write variations such as "My name is John and my age is 28", "Name John and age 28", "I'm called John", "I am called John" etc. – Cosmin SD Sep 09 '16 at 08:18
  • You can try the approach like [RegEx JSON parser](http://stackoverflow.com/a/30494373/2165759) implements, take a look at `Sub ParseJson()`. You replace some elemental char sequences and collocation with tokens, then replace some combinations of tokens with another single token, repeat in loop until you'll get some top level token, that means successful recognition. Then extract the necessary data based on the nesting and the structure of the tokens. – omegastripes Sep 09 '16 at 20:37

2 Answers2

0

What you are actually trying to do is "information extraction", particularly named entity recognition (NER) to detect the mentions of interest. For an overview, see:

https://en.wikipedia.org/wiki/Information_extraction

To actually start to solve your problem with something approaching state of the art I would suggest looking into the Stanford NLP Toolkit (http://nlp.stanford.edu/software/) for your basic NLP tasks (tokenization, POS tagging) but their NER toolkit won't take you very far with your specific requirements. You could tried their SPIED to help you, but I haven't used it and can't vouch for it. Ultimately if you are serious about this task (which on the face of it sounds quite hard) you will have to write your own NER system for all the entities you want to extract. You may want to incorporate some of your regular expressions as machine learning features to help you with your task (start with a simple ML library like LibSVM or Mallet) but regardless it will be a lot of work.

Good luck!

ozborn
  • 980
  • 5
  • 24
0

If the requirement is to identify named entities such as person, place, organisation then one could use StanfordNER library in Python. Additionally, there is solution to training one's own custom entity recognition model using CRF algorithm in Python. Here is an article explaining the same.