5

Let's say that I have a pull (a list) of well known phrases, like: { "I love you", "Your mother is is a ...", "I think I am pregnant" ... } Let's say about a 1000 like these. And now I want the users to enter free text into a text box, and put some kind of NLP engine to digest the text and find the 10 most relevant phrases from the pull that may be related in a way to the text.

  1. I thought that the simplest implementation could be looking by the words. Picking each time one word and looking for similarities in some way. Not sure which?
  2. What most frightens me is the size of a vocabulary that I must support. I am a single developer of some kind of a demo, and I don't like the idea of filling in words into a table...
  3. I am looking for a free NLP engine. I am agnostic about the language it's written in, but it must be free - NOT some kind of an online service that charges by API calls..
tshepang
  • 12,111
  • 21
  • 91
  • 136
  • Have you tried `OpenNLP` from Apache? – Willem Van Onsem Sep 16 '13 at 20:53
  • 1
    when you're working with NLP, you'll have to manage lots of data! the trick is really working with your language to make it as space and time efficient as you can, and representing the data abstractly (for instance, in some NLP methods the vector `[101]` can represent a document with 100 words whereas `[001]` represents some other 100 word document). your concern #2 shouldn't worry too much. that's why we have fast machines and good programming languages. ;) – arturomp Sep 17 '13 at 00:55
  • 1
    [word similarity](http://stackoverflow.com/a/16922499/583834) is more defined than phrase/sentence similarity. google or search SO for "sentence similarity" - some helpful results come up. also look at http://stackoverflow.com/q/6704499/583834 – arturomp Sep 17 '13 at 01:00

2 Answers2

3

It seems that TextBlob and ConeptNet are more than adequate solution to this problem!

2

TextBlob is an easy-to-use NLP library for Python that is free and open source (licensed under the permissive MIT License). It provides a nice wrapper around the excellent NLTK and pattern libraries.

One simple approach to your problem would be to extract noun phrases from your given text.

Here's an example from the TextBlob docs.

from text.blob import TextBlob

text = '''
The titular threat of The Blob has always struck me as the ultimate movie
monster: an insatiably hungry, amoeba-like mass able to penetrate
virtually any safeguard, capable of--as a doomed doctor chillingly
describes it--"assimilating flesh on contact.
Snide comparisons to gelatin be damned, it's a concept with the most
devastating of potential consequences, not unlike the grey goo scenario
proposed by technological theorists fearful of
artificial intelligence run rampant.
'''

blob = TextBlob(text)
print(blob.noun_phrases)
# => ['titular threat', 'blob', 'ultimate movie monster', ...]

This could be a starting point. From there you could experiment with other methods, such as similarity methods as mentioned in the comments or TF-IDF. TextBlob also makes it easy to swap models for noun phrase extraction.

Full disclosure: I am the author of TextBlob.

Steve L
  • 1,704
  • 1
  • 20
  • 29
  • nice library!! as an nltk + pattern user I'm looking forward to checking it out. that being said, I'm not sure how far this goes, though, since noun phrases are only part of the equation... I also doubt there's enough data in short to make TF-IDF useful (although I'm open to being wrong!) – arturomp Sep 17 '13 at 14:42
  • @Steve I have found your answer interesting... Can this python package be used for a large dataset... Check out my question http://stackoverflow.com/questions/21331456/which-is-the-best-fastest-nlp-solution-for-large-data – Ankit Jan 24 '14 at 11:33