The problem:
I'm trying to find a common(most common or even 'average') phrase over a number of strings. The structure across the various strings is very bad and full of inconsistency. Many strings have unique bits added to them that would be irrelevant for the wanted output: A new string that acts like a sort of summary of the set of strings.
To demonstrate I have provided a small example, the actual data is more complex and consists of many more strings:
Example data:
1. Small house, red roof, 2 windows, no garden
2. Big house, red roof, 2 windows, garage with driveway
3. Small house, red roof, 2 windows, nice view
The desired output should be close to:
Small house, red roof, 2 windows
What I've tried and challenges:
On a smaller dataset with quite a lot more structure I have relied on wordcounts before:
words = df['Phrases'].str.split(expand=True).stack().value_counts()
words = words.reset_index()
summary = ""
for i in range(3):
summary += f"{words['index'][i]} "
On this simpler dataset this worked by just taking the n most common phrases and resulted in useful summaries.
Looking at similar questions(e.g. New string from multiple strings with most common words or Find the most common String in ArrayList()) there are quite some similarities between them. Either the 'common phrase' exists in all provided strings, or there is a set threshold for word occurences. Neither of these are the case here.
Another thing I tried is using intersections:
phrases = []
for phrase in df['Phrases']:
phrases.append(phrase.split())
def intersect(list1, list2):
return list(set(list1) & set(list2))
print (intersect(phrases[0], phrases[1])
Using the example data this would then print:
house red roof 2 windows
A problem with intersections and more than two lists is that for each extra list/string it will only remove more and more. If there is just enough variance in the phrases you quickly end up with an empty intersection across all phrases.
Challenges:
- Not guaranteed there is one common phrase across all strings.
- Phrases not guaranteed to be ordered consistently.
- Summary string lenght not fixed.
- Words often(but not always) come in groups: e.g. big/small house
Possible solutions:
One thing I want to try but don't know how to tackle properly is using wordcounts set to percentages, this would eliminate a set number of word occurences threshold but will still need an unknown percentage threshold. Word pairs/groups also won't be contained. Therefor this would be easier to scale up but likely not a proper solution.
Another idea would be to implement some form of approximate string matching but that seems to only work in one way: express the similarity between two strings. So it doesn't provide a new string which has the highest similarity to all given strings.