2

Let’s say I have an array of strings and I need to sort them into clusters. I am currently doing the analysis using n-grams, e.g.:

Cluster 1:

  • Pipe fixing
    • Pipe fixing in Las Vegas
    • Movies about Pipe fixing

Cluster 2:

  • Classical music
    • Why classical music is great
    • What is classical music

etc.

Let’s say within this array I have these two strings of text (among others):

  • Japanese students
  • Students from Japan

Now, the N-gram method will obviously not put these two strings together, as they do not share the same tokenized structure. I tried using Damerau-Levenshtein distance calculation and TF/IDF, but both grab too much outer noise. Which other techniques can I use to understand that these two strings belong within a single cluster?

The Whiz of Oz
  • 6,763
  • 9
  • 48
  • 85

2 Answers2

3

You can use the simple bag-of-words representation of the phrases, taking both unigrams and bigrams(possibly after stemming) and putting them into a feature vector, then measuring the similarity between vectors using, e.g., the cosine. See here or here. This is meant to work for longer documents, but it might work well enough for your purposes.

A more sophisticated technique is to train a distributed bag-of-words model from a corpus of documents, and then use it to find similarities in pairs of documents.

[edit]

You can use distributed BoW models also using word2vec. For example, in Python with the gensim library and a pre-trained word2vec Google News model:

from gensim.models import word2vec
model = word2vec.Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)
print model.n_similarity(['students', 'Japan'], ['Japanese', 'students'])

Output:

0.8718219720170907
Community
  • 1
  • 1
vpekar
  • 3,275
  • 1
  • 19
  • 16
1

You have a normalization problem. String equivalence drives your matching algorithms and "Japan" and "Japanese" are not string equivalent. Several options:

1) Normalize tokens into a root form so "Japanese" is normalized to "Japan" or something like that. Normalization comes with some problems like you don't want "Jobs" to be normalized to "Job" when talking about "Steve Jobs." Porter stemmer, other morphological tools etc can help with this.

2) Use character n-grams for your string equivalence. If you did 3-5 grams there would be instances of "Japan" for both sentences to cluster around. I am a big fan of this for classification, less sure for clustering.

3) Use latent techniques to help cluster like Latent Dirichelet Allocation. Roughly speaking you associate "Japan" with "Japanese" via other string equivalent words strongly associated with those words like "Tokyo" or the like.

Breck