First of all, let me briefly describe my task:
There are about 10^9 sets of words, each sets contains about 10^4 words, which indicates it will be very space consuming to store all those sets in disk. I want to know the number of common words between two sets. However, I don't care what those common words are. I even don't need to know the accurate number which means that an estimated value is enough to me. The key is to find an appropriate representation for these sets so that they could be stored.
A natural idea is to represent a set of words as a vector, and then train a regression NN model whose input are two vectors, which represent two sets, and output is a value which estimates the common words between two input sets. The problem is how to represent these sets as vectors. It's definitely impossible to directly apply the bag of words model, since the dictionary is too big to store. Maybe I can utilize some dimensionality reduction methods on bag of words first to achieve a smaller representation of vector. Or maybe I can apply word embedding to each word, and then sum them up to represent a set. Is there any advice? Or is there any relative work already?
Asked
Active
Viewed 47 times
1

Yu Gu
- 2,382
- 5
- 18
- 33
-
See https://stackoverflow.com/questions/46724680/why-are-word-embedding-actually-vectors – alvas Oct 17 '17 at 06:28
-
I believe you're looking for this: https://en.wikipedia.org/wiki/MinHash – Yasen Oct 17 '17 at 08:14
-
It do shows me another way to view this problem. @Yasen – Yu Gu Oct 17 '17 at 09:20