4

What I'm looking to do is group strings together off of a fiction website. The titles of the posts are generally in the format something like:

titles = ['Series Name: Part 1 - This is the chapter name',
    '[OC] Series Name - Part 2 - Another name with the word chapter and extra oc at the start',
    "[OC] Series Name = part 3 = punctuation could be not matching, so we can't always trust common substrings",
    '{OC} Another cool story - Part I - This is the chapter name',
    '{OC} another cool story: part II: another post title',
    '{OC} another cool story part III but the author forgot delimiters',
    "this is a one-off story, so it doesn't have any friends"]

Delimiters etc aren't always there, and there can be some variation.

I'd start by normalizing the string to just alphanumeric characters.

import re
from pprint import pprint as pp

titles = []  # from above

normalized = []
for title in titles:
    title = re.sub(r'\bOC\b', '', title)
    title = re.sub(r'[^a-zA-Z0-9\']+', ' ', title)
    title = title.strip()
    normalized.append(title)

pp(normalized)

which gives

   ['Series Name Part 1 This is the chapter name',
 'Series Name Part 2 Another name with the word chapter and extra oc at the start',
 "Series Name part 3 punctuation could be not matching so we can't always trust common substrings",
 'Another cool story Part I This is the chapter name',
 'another cool story part II another post title',
 'another cool story part III but the author forgot delimiters',
 "this is a one off story so it doesn't have any friends"]

The output I'm hoping for is:

['Series Name', 
'Another cool story', 
"this is a one-off story, so it doesn't have any friends"]  # last element optional

I know of a few different ways to compare strings...

difflib.SequenceMatcher.ratio()

Levenshtein edit distance

I've also heard of Jaro-Winkler and FuzzyWuzzy.

But all that really matters is that we can get a number showing the similarity between the strings.

I'm thinking I need to come up with (most of) a 2D matrix comparing each string to each other. But once I've got that, I can't wrap my head around how to actually separate them into groups.

I found another post that seems to have done the first part... but then I'm not sure how to continue from there.

scipy.cluster looked promising at first... but then I was in way over my head.

Another thought was somehow combining itertools.combinations() with functools.reduce() with one of the above distance metrics.

Am I way overthinking things? It seems like this should be simple, but it's just not making sense in my head.

Tom L
  • 65
  • 7
  • something that uses Levenshtein edit distance or SimHash might work for you if you are trying to see how similar two text based objects are. I'm a bit confused at what you want to join (if you can provide a manual example). – Aleka May 08 '20 at 04:05
  • If you hare having trouble with capitalization or punctuation, you could turn everything lowercase and remove all punctuation. – Aleka May 08 '20 at 04:12
  • My example may not have been clear enough. I'm parsing through thousands of titles and the punctuation can vary much more wildly than in my little example above. I do like the suggestion of normalizing the string (I'd probably strip/replace everything down to all lowercase alphanumeric titles) but that still leaves the question of once I have a set of normalized strings, how do I extract the series and what titles are in each – Tom L May 08 '20 at 04:23
  • You are trying join similar titles while removing redundant info from the final result? – Aleka May 08 '20 at 04:25

2 Answers2

4

This is an implementation of the ideas put forth in CKM's answer: https://stackoverflow.com/a/61671971/42346

First take out the punctuation -- it's not important to your purpose -- using this answer: https://stackoverflow.com/a/15555162/42346

Then we'll use one of the techniques described here: https://blog.eduonix.com/artificial-intelligence/clustering-similar-sentences-together-using-machine-learning/ to cluster similar sentences.

from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+') # only alphanumeric characters

lol_tokenized = []
for title in titles:
    lol_tokenized.append(tokenizer.tokenize(title))

Then get a numeric representation of your titles:

import numpy as np 
from gensim.models import Word2Vec

m = Word2Vec(lol_tokenized,size=50,min_count=1,cbow_mean=1)  
def vectorizer(sent,m): 
    vec = [] 
    numw = 0 
    for w in sent: 
        try: 
            if numw == 0: 
                vec = m[w] 
            else: 
                vec = np.add(vec, m[w]) 
            numw += 1 
        except Exception as e: 
            print(e) 
    return np.asarray(vec) / numw 

l = []
for i in lol_tokenized:
    l.append(vectorizer(i,m))

X = np.array(l)

Whoo-boy that was a lot.
Now you have to do the clustering.

from sklearn.cluster import KMeans

clf = KMeans(n_clusters=2,init='k-means++',n_init=100,random_state=0)
labels = clf.fit_predict(X)
print(labels)
for index, sentence in enumerate(lol_tokenized):
    print(str(labels[index]) + ":" + str(sentence))

[1 1 0 1 0 0 0]
1:['Series', 'Name', 'Part', '1', 'This', 'is', 'the', 'chapter', 'name']
1:['OC', 'Series', 'Name', 'Part', '2', 'Another', 'name', 'with', 'the', 'word', 'chapter', 'and', 'extra', 'oc', 'at', 'the', 'start']
0:['OC', 'Series', 'Name', 'part', '3', 'punctuation', 'could', 'be', 'not', 'matching', 'so', 'we', 'can', 't', 'always', 'trust', 'common', 'substrings']
1:['OC', 'Another', 'cool', 'story', 'Part', 'I', 'This', 'is', 'the', 'chapter', 'name']
0:['OC', 'another', 'cool', 'story', 'part', 'II', 'another', 'post', 'title']
0:['OC', 'another', 'cool', 'story', 'part', 'III', 'but', 'the', 'author', 'forgot', 'delimiters']
0:['this', 'is', 'a', 'one', 'off', 'story', 'so', 'it', 'doesn', 't', 'have', 'any', 'friends']

Then you can pull out the ones with index == 1:

for index, sentence in enumerate(lol_tokenized): 
    if labels[index] == 1: 
        print(sentence) 

['Series', 'Name', 'Part', '1', 'This', 'is', 'the', 'chapter', 'name']
['OC', 'Series', 'Name', 'Part', '2', 'Another', 'name', 'with', 'the', 'word', 'chapter', 'and', 'extra', 'oc', 'at', 'the', 'start']
['OC', 'Another', 'cool', 'story', 'Part', 'I', 'This', 'is', 'the', 'chapter', 'name']
mechanical_meat
  • 163,903
  • 24
  • 228
  • 223
  • I am so utterly lost, I'll have to read up on... most of the functions, heh. vectorizer() in particular confuses me, but I've never seen an implementation of something like that. But thank you so much for writing this up for me! The more I look at it, the more it makes sense... – Tom L May 08 '20 at 06:32
  • 2
    You're very welcome. I'm particularly interested in how this performs for your full dataset so if you'd come back and tell us how it went I'd appreciate it. Best of luck. – mechanical_meat May 08 '20 at 06:40
  • 2
    I was interested in that answer as well. I noticed that `OC Series Name Part 3` was clustered to cluster `1` when parts 1 and 2 are in cluster `0`. @mechanical_meat Do you think it is related to the `This is the chapter name` part? If so, maybe cutting the title after some words will give better matches? – theFrok May 08 '20 at 07:04
  • The only thing I see left to do (and please don't feel like you have to, you gotta leave something for me to do, lol) is to cut those 'center' titles down to just the series title instead of the whole string. I think I can do that simply by comparing it with other strings and finding common substrings starting at the beginning. – Tom L May 08 '20 at 07:04
  • Also, is there a way to adapt this solution to a case where you don't know the number of titles before-hand? – theFrok May 08 '20 at 07:05
  • @theFrok where do you see which cluster they are clustered to? From what I could read of it, the 'index == 1' were the best representatives of each cluster. – Tom L May 08 '20 at 07:08
  • The label is the index of the class, here there are only 2 clusters so the labels are `0/1`. https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans.fit_predict – theFrok May 08 '20 at 07:19
  • I made a couple of adjustments because the outcome wasn't deterministic. Now it comes up with something different for the first group of titles. Thanks to you both for your comments. – mechanical_meat May 08 '20 at 08:18
  • @TomL: I'm not quite following about the 'center' titles. – mechanical_meat May 08 '20 at 08:24
  • @theFrok: I don't think this is dependent on knowing the number of titles beforehand. What makes you think that it is? – mechanical_meat May 08 '20 at 08:25
  • @mechanical_meat: I meant the number of series titles like `Another cool story`, which translates to the number of clusters in the solution. You need to decide the number of clusters before you use `KMeans`, but what should you do if you don't have that info? – theFrok May 08 '20 at 14:37
  • @theFrok: the reason I used 2 clusters is because it's either a duplicate-looking title or not one. I could be wrong in my approach, though, so seeing more data would be helpful. I'll try to get a larger dataset from the OP. Thanks again for your comment. – mechanical_meat May 08 '20 at 14:43
  • You described what I meant by 'center' titles when you said duplicate-looking or not. – Tom L May 08 '20 at 15:40
1

Your task falls into what is known as semantic similarity. I propose you proceed as follows:

  1. Get a mapping of your strings through Glove/Word2vec or popular BERT. This will give you a numeric representation of your titles.
  2. Then perform clustering starting with scikit's k-means and then you can go for advanced methods for clustering.
CKM
  • 1,911
  • 2
  • 23
  • 30