3

So I am trying to write a program that will take in 2 strings, for example:

"I like pizza better cold"

And

"I really enjoy pizza when it is chilled"

And figure out if these two things match each other in comparison to something like:

"I like pizza better cold"

And

"Pizza really sucks."

Where the above would not be a match.

I have come to the NLTK language available for Python to do this. I am just wondering if there is anyone out there who has worked on something like this before and has any advice? Is NLTK the way to go? Any functions or specs I should use?

I am thinking about splitting the strings into tokens and then picking out the adjectives and nouns as the main method of tagging then possibly using a sentiment analysis algorithm to determine if it is positive or not then match the strings based on this...

This is just a small side project I am working on for fun, so anything would be beneficial here :)

Cheers, Will

Ahsanul Haque
  • 10,676
  • 4
  • 41
  • 57
Willy
  • 213
  • 4
  • 10
  • 1
    I don't think `NLTK` has something like this. You have to make a custom program where the `pizza really sucks` have a list of the above sentence or just the `chill`,`cold` .In short you need to have dictionary of words and point those to a sentence if they present in a sentence. – Nikhil Parmar Jan 18 '16 at 05:06
  • 1
    Mhhhrgh, I think you should start by having a better understanding of NLP. This is, of course, a difficult topic (and I'm no more than an amateur at that). You have a nice start here: http://www.nltk.org/book_1ed/ I don't quite understand your problem, but it smells like POS tagging, where POS stands for Part of Speech. Pretty much what you do at primary school. – finiteautomata Jan 18 '16 at 05:22
  • 1
    In http://www.nltk.org/book/ch05.html you have an introduction to POS tagging with nltk. – finiteautomata Jan 18 '16 at 05:23
  • 1
    @geekazoid I think this is a classification problem classifying the sentences into one like `chilled` and `cold` pizzas are same but not the `sucks` one. A particular sentence can be written in many ways I don't think `POS` tagging is of any help here – Nikhil Parmar Jan 18 '16 at 05:29
  • 1
    @NikhilParmar ok, that's a point of view. I could see as looking whether the subject/object of the sentence are the same. But that's up to the OP :) – finiteautomata Jan 18 '16 at 05:32
  • Thanks for the weigh in @geekazoid and NikhilParmar, I really appreciate it! I will for sure go read up on the NLTK book thre Geekazoid and check out chapter 5. I guess to further clarify, I think I am doing a combination of both methods mentioned here, as I am trying to understand and match context by tagging the string that was put in. So something along the lines where it would pull out Pizza as the noun, and cold/chilled as the Adj, then find that it is a positive sentiment analysis. After it tags "Cold Pizza - Positive" or something along the lines of that, it will match to similar string – Willy Jan 18 '16 at 05:56
  • 1
    @NikhilParmar mentioned above as well. Again, I really appreciate you weighing in like this! I just want to sort out the kinks in my thought process by other smarter people like yourselves haha – Willy Jan 18 '16 at 05:57
  • 1
    One simple way would be to take composed word vectors (additive or multiplicative) of each sentence and then use some distance metric to compute the distance between any two sentences. – Riyaz Jan 18 '16 at 17:36
  • @Riyaz would you happen to have any reference material relating to python implementations or articles by any chance? :) I was thinking of computing something first (possibly along the lines of this additive or multiplicative) to classify, then run a distance algorithm like the Cosine Distance to do a quick match, then if nothing, do a deeper search on a 3 pass style system. – Willy Jan 19 '16 at 01:32
  • 1
    Check Gensim package at https://radimrehurek.com/gensim/models/word2vec.html and this article about composition http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf – Riyaz Jan 19 '16 at 07:30

1 Answers1

3

According to your question you want to compare two sentences and then probably find out how much percentage they match.

For finding the similarity between sentences you can use Jaccard Similarity or Cosine Similarity.

Refer this for Cosine Similarity How to calculate cosine similarity given 2 sentence strings? - Python

If the cosine similarity is less then the sentences are nor similar but if it is closer to 1 then the sentences are similar

NLTK can be used to find the synonyms of the words in the sentence so that you can get semantics from the sentence.

For finding synonyms you could use the following code:

from nltk.corpus import wordnet as wn
wn.synsets(your word)
Community
  • 1
  • 1
Rohan Amrute
  • 764
  • 1
  • 9
  • 23
  • the implementation of the cosine similarity would be a pretty good "reassurance" check for sure, or a first pass check. Seeing as an NLTK implementation would require a heavier process to check, this could be a quick check before a heavier comparison function is used. Thank you for the information and I am sure this will make its way into the implementation in the end :) I was also looking into hammering distance, this would kind of follow along the same lines as the Cosine similarity here, am I right? Cheers – Willy Jan 18 '16 at 06:32
  • I don't have any idea about Hammering distance, need to read it. Happy to help you :) – Rohan Amrute Jan 18 '16 at 07:16
  • 2
    It must be hamming distance not hammering distance https://en.wikipedia.org/wiki/Hamming_distance. – Riyaz Jan 18 '16 at 17:38
  • Thanks for the correction @Riyaz! Yes, I mean Hamming Distance :) – Willy Jan 19 '16 at 01:34
  • @Willy You could select this answer if you think it is useful for you – Rohan Amrute Jan 19 '16 at 05:32