22

I don't know whether StackOverflow covers NLP, so I am gonna give this a shot. I am interested to find the semantic relatedness of two words from a specific domain, i.e. "image quality" and "noise". I am doing some research to determine if reviews of cameras are positive or negative for a particular attribute of the camera. (like image quality in each one of the reviews).

However, not everybody uses the exact same wording "image quality" in the posts, so I am out to see if there is a way for me to build something like that:

"image quality" which includes ("noise", "color", "sharpness", etc etc) so I can wrap all everything within one big umbrella.

I am doing this for another language, so Wordnet is not necessarily helpful. And no, I do not work for Google or Microsoft so I do not have data from people's clicking behaviour as input data either.

However, I do have a lot of text, pos-tagged, segmented etc.

Cyclotron3x3
  • 2,188
  • 23
  • 40
sadawd
  • 399
  • 1
  • 4
  • 9
  • It would be useful if you could say more about the data you're working with and the exact task you would like to perform. Are you trying to classify the contents of individual reviews as being positive or negative? Or, are you assuming that the reviews are already labeled as positive or negative and you are trying to figure out what attributes of the camera lead to a user's feelings about the product (e.g., the product was given 1 out of 5 stars, and the user mentions 'image quality' in the review, so you infer that the image quality is bad)? – dmcer Mar 14 '10 at 06:46
  • Oops. Confused NLP/Natural Language Processing with NLP/Neuro-Linguistic Programming. My bad. – JUST MY correct OPINION Mar 14 '10 at 07:48
  • 1) I would like to find the umbrella classification of terms: like how multiple attributes actually belong to the same category (I guess this is classification then?) I have only dealt with classificaion through machine learning method, which I highly doubt can be applied to NLP 2) I want basically something to tell me the similarity between two concept terms: "focus" vs "Details" should be higher than "camera weight" vs "flash" – sadawd Mar 14 '10 at 07:53

8 Answers8

5

Check out google similarity distance - http://arxiv.org/abs/cs.CL/0412098 eg. if lots of webpages include them both, theyre probably related.

demo program at http://mechanicalcinderella.com

Other than that, you could try to translate a project like wordnet ((google translate could help), or start a collaborative ontology.

Sweet Burlap
  • 346
  • 3
  • 9
  • On that demo the connection of (programming, animal) is stronger that the connection of (programming, html) ) http://www.mechanicalcinderella.com/index.php?inset%5B%5D=animal&inset%5B%5D=html&inset%5B%5D=&inset%5B%5D=&inset%5B%5D=&inatr%5B%5D=programming&inatr%5B%5D=&inatr%5B%5D=&inatr%5B%5D=&inatr%5B%5D=&domena=#results – Mher Jul 03 '14 at 10:35
5

In order to find semantic similarity between words, a word space model should do the trick. Such a model can be implemented very easily and fairly efficiently. Most likely, you will want to implement some sort of dimensionality reduction. The easiest one I can think of is Random Indexing, which has been used extensively in NLP.

Once you have your word space model, you can calculate distances (e.g. cosine distance) between words. In such a model, you should get the results you mentioned earlier (distance between "focus" and "Details" should be higher than "camera weight" vs "flash").

Hope this helps!

Bjerva
  • 213
  • 2
  • 5
4

Re your comment:

  1. Classifiation through machine learning is being used for NLP all the time.
  2. Regarding semantic similarity between concepts, see Dekang Lin's information theoretic definition of similarity.

Please also see these questions: finding related words,semantic similarity of two phrases.

Community
  • 1
  • 1
Yuval F
  • 20,565
  • 5
  • 44
  • 69
2

I saw word2vec on HackerNews a couple of weeks ago, looks pretty close to what you want.

Alix Axel
  • 151,645
  • 95
  • 393
  • 500
2

Take a look at Latent Semantic Indexing http://en.wikipedia.org/wiki/Latent_semantic_indexing it specifically addresses your problem. However you need to come up with some way to correlate these meta concepts with either positive or negative sentiments. Sentiment analysis http://en.wikipedia.org/wiki/Sentiment_analysis should help you.

Vlad
  • 9,180
  • 5
  • 48
  • 67
  • 1
    Here is a good resource for really learning LSI (if you are willing to put in some work) http://nlp.stanford.edu/IR-book/pdf/18lsi.pdf – bernie2436 Sep 23 '14 at 02:24
2

Word-Space is definitely the way to go here.If LSA is to slow for your application and if the semantics in random-indexing is too shallow the you should consider api.cortical.io . This REST API can give you the semantic fingerprint representation of any word. This semantic fingerprint contains all the different contexts to which the words belong. You can disambiguate any word wit one call like "organ" returns (muscle, piano, church, membership...) And for each of the contexts you can get contextual terms: "piano" will give (organ, clarinet, violin, flute, cello, compositions, harpsichord, orchestral) Concerning your last aspect, these semantic fingerprints are fully language independent. Currently cortical.io API covers: English, Spanish, French, German, Danish, Arabic, Russian, Chinese. More languages are being published until the end of 2014.

0

You might want to take a look at the book Opinion mining and sentiment analysis. If you are only interested in similarity of words and phrases, this survey paper may help you: From Frequency to Meaning: Vector Space Models of Semantics

ephes
  • 1,451
  • 1
  • 13
  • 19
0

One effective approach for determining semantic similarity between two texts involves utilizing word embeddings along with cosine similarity. The Sentence Transformers Python framework provides transformers which gives state-of-art word embeddings specifically designed for this purpose. Below is an example code:

import sentence_transformers
from sklearn.metrics.pairwise import cosine_similarity

model_name = 'all-mpnet-base-v2'

# Instantiate a SentenceTransformer model
model = sentence_transformers.SentenceTransformer(model_name)

# Encode the input texts to obtain their embeddings
embeddings1 = model.encode('text1')
embeddings2 = model.encode('text2')

# Calculate the cosine similarity between the embeddings
similarity_score = cosine_similarity([embeddings1, embeddings2])[0][1]

For more information and to access the Sentence Transformers framework, visit this link.