Let’s say I have an array of strings and I need to sort them into clusters. I am currently doing the analysis using n-grams, e.g.:
Cluster 1:
- Pipe fixing
- Pipe fixing in Las Vegas
- Movies about Pipe fixing
Cluster 2:
- Classical music
- Why classical music is great
- What is classical music
etc.
Let’s say within this array I have these two strings of text (among others):
- Japanese students
- Students from Japan
Now, the N-gram method will obviously not put these two strings together, as they do not share the same tokenized structure. I tried using Damerau-Levenshtein distance calculation and TF/IDF, but both grab too much outer noise. Which other techniques can I use to understand that these two strings belong within a single cluster?