3

I have a huge dataset with words word_i and weights weight[i,j], where weight is the "connection strength" between words.

I'd like to binarize this data, but I want to know if there is any existing algorithm to make binary code of each word in such a way that the Hamming distance between the codes of the words correlates with this weight.

Added:
The problem I am working on is that I want to try to teach a neural net or SVM to make associations between words. And that's why I've decided to binarize data. Don't ask why I don't want to use Markov models or just graphs, I've tried them and want to compare them with neural nets.

So,

  1. I want my NN on given word "a" return its closest association or any set words and their probabilities,

  2. I've tried to just binarize and make "ab" as input and weight as preferred answer, this worked badly,

  3. I was thinking of making the threshold (for weights) to change 1 more bit. The smaller this threshold is, the more bits you require,

  4. I have a situation: a->b w1; b->a w2; w1>>w2, so direction is significant.

Potatoswatter
  • 134,909
  • 25
  • 265
  • 421
Ivri
  • 2,159
  • 5
  • 25
  • 33
  • Correlates how strongly? Does the binary code need to be as verbose as possible, or can it be whatever? Why are you doing this? – Hamish Grubijan Feb 19 '10 at 19:05
  • 1
    Please clarify your question. Binarize == serialize? Do you need an algorithm to compute the Hamming distance? What exactly is the problem? – Yuval Adam Feb 19 '10 at 19:06
  • From what you've said, I'm guessing that you want to replace each word with a binary string such that the hamming distance is equal to the weight. Correct? This would imply that the weight is an integer, also correct? – Justin Peel Feb 19 '10 at 19:41
  • I don't need serialization and I know how to calculate hamming distance. I want to know if there is any algorithm to binarize data in such way that HD (a,b) correlates with weight(a,b) – Ivri Feb 19 '10 at 20:20
  • Cool problem, man. I look forward to seeing the answer. What's this for, anyway? – Paul Nathan Feb 19 '10 at 23:20

1 Answers1

1

what you can do is to use a self-organizing map (SOM) with the topology of fixed length, say N-bit, words, so that e.g. if N=8 then every cell in the SOM has exactly 8 neighbors (those where one bit has been flipped). Now if you have K [dictionary] words you can encode every [dictionary] word as a vector of real numbers between 0..1 so that the ith word has the ith element set to 1 and others to 0. You can then calculate the "distance" between two arbitrary vectors a1...aK and b1...bK by summing over

 i,j : ai * bj * distance(ai, bj)

which gives you the distance metric for running the SOM algorithm. When the SOM has stabilized, [dictionary] words near to each other in your metric are near to each other in the topology of the map, from which you get the encoding trivially as [binary] words.

Note that the map must have more cells than there are words, i.e. 2**N > K.

This answer of course assumes background with self organizing maps. See http://en.wikipedia.org/wiki/Self-organizing_map

Antti Huima
  • 25,136
  • 3
  • 52
  • 71
  • >>. Now if you have K [dictionary] words you can encode every [dictionary] word as a vector of real numbers between 0..1 so that the ith word has the ith element set to 1 and others to 0. Does it mean I'll have number of bits==number of words? Wouldn't it be better to encode like general binary coding, having number of bits=log(N), N-number of dict words? Or do you mean to encode it in matrix (N*N), where wij is weight(ai,bj). In such case, having every row as an example for SOM, number of examples==number of variables. Thank you! – Ivri Feb 22 '10 at 22:21
  • I've solved the problem using MDS (multi-dimensional scaling) – Ivri May 05 '10 at 21:40