What I'm looking to do is group strings together off of a fiction website. The titles of the posts are generally in the format something like:
titles = ['Series Name: Part 1 - This is the chapter name',
'[OC] Series Name - Part 2 - Another name with the word chapter and extra oc at the start',
"[OC] Series Name = part 3 = punctuation could be not matching, so we can't always trust common substrings",
'{OC} Another cool story - Part I - This is the chapter name',
'{OC} another cool story: part II: another post title',
'{OC} another cool story part III but the author forgot delimiters',
"this is a one-off story, so it doesn't have any friends"]
Delimiters etc aren't always there, and there can be some variation.
I'd start by normalizing the string to just alphanumeric characters.
import re
from pprint import pprint as pp
titles = [] # from above
normalized = []
for title in titles:
title = re.sub(r'\bOC\b', '', title)
title = re.sub(r'[^a-zA-Z0-9\']+', ' ', title)
title = title.strip()
normalized.append(title)
pp(normalized)
which gives
['Series Name Part 1 This is the chapter name',
'Series Name Part 2 Another name with the word chapter and extra oc at the start',
"Series Name part 3 punctuation could be not matching so we can't always trust common substrings",
'Another cool story Part I This is the chapter name',
'another cool story part II another post title',
'another cool story part III but the author forgot delimiters',
"this is a one off story so it doesn't have any friends"]
The output I'm hoping for is:
['Series Name',
'Another cool story',
"this is a one-off story, so it doesn't have any friends"] # last element optional
I know of a few different ways to compare strings...
difflib.SequenceMatcher.ratio()
I've also heard of Jaro-Winkler and FuzzyWuzzy.
But all that really matters is that we can get a number showing the similarity between the strings.
I'm thinking I need to come up with (most of) a 2D matrix comparing each string to each other. But once I've got that, I can't wrap my head around how to actually separate them into groups.
I found another post that seems to have done the first part... but then I'm not sure how to continue from there.
scipy.cluster looked promising at first... but then I was in way over my head.
Another thought was somehow combining itertools.combinations() with functools.reduce() with one of the above distance metrics.
Am I way overthinking things? It seems like this should be simple, but it's just not making sense in my head.