5

Usually one wants to get a feature from a text by using the bag of words approach, counting the words and calculate different measures, for example tf-idf values, like this: How to include words as numerical feature in classification

But my problem is different, I want to extract a feature vector from a single word. I want to know for example that potatoes and french fries are close to each other in the vector space, since they are both made of potatoes. I want to know that milk and cream also are close, hot and warm, stone and hard and so on.

What is this problem called? Can I learn the similarities and features of words by just looking at a large number documents?

I will not make the implementation in English, so I can't use databases.

Community
  • 1
  • 1
user1506145
  • 5,176
  • 11
  • 46
  • 75
  • 2
    Your title is misleading. You want to extract _relations_ between words (or rather, _concepts_) from large corpora, not features from single words. With regards to a name for this problem, I'd call it _automatic creation of an ontology from unstructured text_. – jogojapan Feb 13 '13 at 01:15
  • vector embeddings of words like word2vec, glove or fastText? – user Aug 28 '16 at 13:31
  • Although this question is too old but I would like to answer it for future help. You can use CBOW or SkipGram word embedding in gensim library where on passing the text corpus as input, you will get embedding for every single word and the similarity function will return you the most similar word against your query word. while the dis-similarity function will return the most opposite word to the query word. For example : given a political text corpus if you input the word donald then most probably you will get the word trump as the most similar word to it – Muhammad Saad Feb 14 '20 at 15:17

3 Answers3

5

hmm,feature extraction (e.g. tf-idf) on text data are based on statistics. On the other hand, you are looking for sense (semantics). Therefore no such a method like tf-idef will work for you.

In NLP exists 3 basic levels:

  1. morphological analyses
  2. syntactic analyses
  3. semantic analyses

(higher number represents bigger problems :)). Morphology is known for majority languages. Syntactic analyses is a bigger problem (it deals with things like what is verb, noun in some sentence,...). Semantic analyses has the most challenges, since it deals with meaning which is quite difficult to represent in machines, have many exceptions and are language-specific.

As far as I understand you want to know some relationships between words, this can be done via so-called dependency tree banks, (or just treebank): http://en.wikipedia.org/wiki/Treebank . It is a database/graph of sentences where a word can be considered as a node and relationship as arc. There is good treebank for czech language and for english there will be also some, but for many 'less-covered' languages it can be a problem to find one ...

xhudik
  • 2,414
  • 1
  • 21
  • 39
  • First you explain the difference between syntax and semantics, and then you suggest using a treebank (which is fundamentally about syntax) to extract semantic relations? – jogojapan Feb 13 '13 at 01:18
  • @jogojapan i wasn't really what user1506145 wants in fact. It looks like something between, therefore i gave him a clue what it is about and now he shoudl be able easily to find appropriate literature and find out whether treebank is ok for him, or he needs something more.... Do you see some inconsistency there? – xhudik Feb 13 '13 at 08:05
  • The OP is interested in semantic relations, e.g. "milk IS_RELATED_TO cream", or even "cream IS_MADE_OF milk". A tree bank is about syntactic relations in a given corpus, i.e. it contains information like "'milk' is the direct object of the verb in the sentence 'I bought milk yesterday'". In the first part of the answer you seem to be aware of this difference, but the second part you mix it all together. – jogojapan Feb 13 '13 at 08:16
  • yep, the answer is far from perfect, you are welcomed to write your version... I wanted to give him some broader picture ... I'm aware that different treebanks have code different information - in same you can find parts of semantics (maybe i'm wrong). But you are right treebanks are mostly for syntax. – xhudik Feb 13 '13 at 10:10
  • For an example implementation of obtaining semantic similarity, see this [answer](http://stackoverflow.com/a/14638272/1988505) in a related question. – Wesley Baugh Feb 19 '13 at 08:20
1

user1506145,

Here is a simple idea that I have used in the past. Collect a large number of short documents like Wikipedia articles. Do a word count on each document. For the ith document and the jth word let

I = the number of documents,

J = the number of words,

x_ij = the number of times the jth word appears in the ith document, and

y_ij = ln( 1+ x_ij).

Let [U, D, V] = svd(Y) be the singular value decomposition of Y. So Y = U*D*transpose(V)), U is IxI, D is diagonal IxJ, and V is JxJ.

You can use (V_1j, V_2j, V_3j, V_4j) as a feature vector in R^4 for the jth word.

Hans Scundal
  • 476
  • 3
  • 3
1

I am surprised the previous answers haven't mentioned word embedding. Word embedding algorithm can produce word vectors for each word a given dataset. These algorithms can nfer word vectors from the context. For instance, by looking at the context of the following sentences we can say that "clever" and "smart" is somehow related. Because the context is almost the same.

He is a clever guy He is a smart guy

A co-occurrence matrix can be constructed to do this. However, it is too inefficient. A famous technique designed for this purpose is called Word2Vec. It can be studied from the following papers.
https://arxiv.org/pdf/1411.2738.pdf
https://arxiv.org/pdf/1402.3722.pdf

I have been using it for Swedish. It is quite effective in detecting similar words and completely unsupervised.

A package could be find in gensim and tensorflow.

shmsi
  • 958
  • 10
  • 9