1

The problem:

I'm trying to find a common(most common or even 'average') phrase over a number of strings. The structure across the various strings is very bad and full of inconsistency. Many strings have unique bits added to them that would be irrelevant for the wanted output: A new string that acts like a sort of summary of the set of strings.

To demonstrate I have provided a small example, the actual data is more complex and consists of many more strings:

Example data:

1. Small house, red roof, 2 windows, no garden
2. Big house, red roof, 2 windows, garage with driveway
3. Small house, red roof, 2 windows, nice view 

The desired output should be close to:

Small house, red roof, 2 windows

What I've tried and challenges:

On a smaller dataset with quite a lot more structure I have relied on wordcounts before:

words = df['Phrases'].str.split(expand=True).stack().value_counts()
words = words.reset_index()

summary = ""
for i in range(3):
    summary += f"{words['index'][i]} "

On this simpler dataset this worked by just taking the n most common phrases and resulted in useful summaries.

Looking at similar questions(e.g. New string from multiple strings with most common words or Find the most common String in ArrayList()) there are quite some similarities between them. Either the 'common phrase' exists in all provided strings, or there is a set threshold for word occurences. Neither of these are the case here.

Another thing I tried is using intersections:

phrases = []
for phrase in df['Phrases']:
    phrases.append(phrase.split())

def intersect(list1, list2):
    return list(set(list1) & set(list2))

print (intersect(phrases[0], phrases[1])

Using the example data this would then print:

house red roof 2 windows

A problem with intersections and more than two lists is that for each extra list/string it will only remove more and more. If there is just enough variance in the phrases you quickly end up with an empty intersection across all phrases.

Challenges:

  • Not guaranteed there is one common phrase across all strings.
  • Phrases not guaranteed to be ordered consistently.
  • Summary string lenght not fixed.
  • Words often(but not always) come in groups: e.g. big/small house

Possible solutions:

One thing I want to try but don't know how to tackle properly is using wordcounts set to percentages, this would eliminate a set number of word occurences threshold but will still need an unknown percentage threshold. Word pairs/groups also won't be contained. Therefor this would be easier to scale up but likely not a proper solution.

Another idea would be to implement some form of approximate string matching but that seems to only work in one way: express the similarity between two strings. So it doesn't provide a new string which has the highest similarity to all given strings.

SB18
  • 33
  • 3

1 Answers1

0

This is not a complete answer, but might give you some ideas.

  1. Split the strings into comma-separated lists of strings is an obvious first step looking at the example given
  2. From there, if the input is sufficiently structured you could zip across the input list by index ie. on the example provided, [[“Small house", "big house", "small house"],["red roof", "red roof", "red roof"], ...] Working on that list then take the mode element in each sublist (eg. provided it precisely matches at least 30% of the inputs, otherwise either ignore the sublist or break it down further eg. into words and look for most common word(s))
  3. If the input lacked sufficient structure, you could consider zipping each tokenized input list with the naturals. This would give you a way to express "how early on" in the input list some token is likely to appear, and therefore impose an order upon the tokens (or words thereof) which are eventually included in the output.
moonGoose
  • 1,510
  • 6
  • 14
  • I agree with the first point but if data is not structured it would be better to create a dataset of documents of comma-separated data and using bigrams on them. – Kartikey Singh Apr 04 '19 at 10:10