NLP, algorithms for determining if block of text is "similar" to other (after already having matched for keyword)

Question

I've been reading up on NLP as much as I can and searching on here but haven't found anything that seems to address exactly what I am trying to do. I am pretty new to NLP, only having had some minor exposure before, so far I have gotten the NLP processor I'm using working to where I am able to extract the POS from the text.

I am just working with a small sample document and then with one "input phrase" that I am basically trying to find a match for. The code I've written so far basically does this:

takes the input phrase and the "searchee (document being searched on)" and breaks them down into Lists of individual words, then also gets the POS for each word. User also puts in one kewyord that is in the input phrase (and should be in doc being searched)
both Lists are searched for the keyword that the user input, then, for the first place this keyword is found in each document, a set number of words before and after are taken (such as 5). These are put into a dataset for processing, so if one article had:

keyword: football

"A lot of sports are fun, football is a great, yet very physical sport." - Then my process would truncate this down to "are fun, football is a"

My goal is to compare the pieces, such as the "are fun, football is a" for similarity as far as if they are likely to be used in a similar context, etc.

I'm wondering if anyone can point me in the right direction as far as patterns that could be used for this, algorithms, etc. The example above is simplistic, just to give an idea, but I would be planning to make this more complex if I can find the right place to learn more about this. Thanks for any info

after more searching, this question seems to touch on some similar : http://stackoverflow.com/questions/1746501/can-someone-give-an-example-of-cosine-similarity-in-very-simple-graphical-way — Rick, Nov 12 '11 at 21:42
this seems to be a good resource too: http://sujitpal.blogspot.com/2008/09/ir-math-with-java-similarity-measures.html thanks everyone for helping me get on the right path, now I am beginning to understand some of the terminology to know what to search for — Rick, Nov 12 '11 at 21:49
@eowl, thanks, I just committed to the proposal for the NLP site — Rick, Dec 11 '11 at 18:32

score 4 · Accepted Answer · answered Nov 12 '11 at 13:20

4

It seems you're solving the good old KWIC problem. That can be done with indexing, or just a simple for loop through the words in a text:

for i = 0 to length(text):
    if text[i] == word:
        emit(text[i-2], text[i-1], text[i], text[i+1], text[i+2])

Where emit might mean print them, store them in a hashtable, whatever.

answered Nov 12 '11 at 13:20

Fred Foo

355,277
75
744
836

I think this is a bit more basic then what I'm trying to do, I'm not looking for an exact match as I don't expect to find that – Rick Nov 12 '11 at 19:10
@Rick: it's not entirely clear to me what the final output of your program should be, but for investigating the contexts in which a word may occur you might want to read the collocation finding chapter of [FSNLP](http://nlp.stanford.edu/fsnlp/) or otherwise read up on [pointwise mutual information](https://en.wikipedia.org/wiki/Pointwise_mutual_information). – Fred Foo Nov 12 '11 at 19:15
Thanks, will check that out, I guess its a little hard for me to explain as well what I think it will be, but along the lines of trying to match if a phrase is "close" to another phrase but not exact – Rick Nov 12 '11 at 19:24

score 2 · Answer 2 · answered Nov 12 '11 at 10:04

2

What you are trying to do is more of a classic Information Retrieval problem than NLP, though they are very similar. You are building a Term-Frequency dictionary.

I'm not sure what you mean by POS, but you are trying to extract "shingles" of phrases from the text and compare them with other shingles in your corpus. You can compute similar via cosine similarity or by calculating the String Edit Distance between the phrases.

It may help to review some introductory IR slides to clarify these concepts. Dr. Rao Kambhampati generously makes slides and audio lectures available on his site.

answered Nov 12 '11 at 10:04

Visionary Software Solutions

3,825
2
29
35

Thanks for the info, by POS I just mean the basic parts of speech, like the ones listed at "Word Level" here: http://bulba.sdsu.edu/jeanette/thesis/PennTags.html The info you gave should be a good starting point – Rick Nov 12 '11 at 19:07
Also, not sure I was clear in the OP, I am trying to compare patterns in the POS for phrases, not so much the actual words, since I have already matched for keyword – Rick Nov 12 '11 at 21:03

score 1 · Answer 3 · answered Nov 12 '11 at 10:19

1

If you just want to generate a text you can look here http://phpir.com/text-generation. If you want to look for similarities you can look for a trigram-search or more simple a wildcard search with a trie: http://phpir.com/tries-and-wildcards. Here is a good article about shingling:http://phpir.com/shingling-near-duplicate-detection

answered Nov 12 '11 at 10:19

Micromega

12,486
7
35
72

Thanks, I did come across some things regarding tries earlier, will look into that more – Rick Nov 12 '11 at 19:09

NLP, algorithms for determining if block of text is "similar" to other (after already having matched for keyword)

3 Answers3