Finding a common substring has been answered in many questions, i.e. given a list of strings find the (typically longest or largest number of words) substring that is common to all of them. See here:
Longest common substring from more than two strings - Python
My question is, how does one go about finding the longest modal substring in a list of strings? The important stipulation here is that this substring does not necessarily have to appear in all strings in the list.
There is a little bit of art to the science here because the obvious tradeoff is between 1) how many strings do you want the substring to appear in? and 2) how long do you want the substring to be? To fix ideas, lets just assume that we want the desired substring to have three words (in case of a tie here, take the longest string, followed by first instance).
So given the list,
mylist = ["hey how's it going?", "I'm fine thanks.", "Did you get that thing?", "Of course I got that thing, that's why I asked you: how's it going?"]
The desired output is,
"how's it going"
If the stipulation was instead two words long then the desired output would be,
"that thing"
Since "that thing" is a longer string than, "how's it" or "it going"
The above answers in code are the modal substrings of three and two words long, respectively.
EDIT:
Since there is a bounty on this I will be a little more specific with what a modal sub-string is.
Modal substring: For a given length of words in the substring (this is needed to uniquely identify the modal substring), the modal substring is the substring that is common to the largest number of strings in the list. If there is a tie (i.e. for a given length of three words in the substring there are two candidate substrings that both appear in 80% of the strings) then the substring with the longest character length should be used. If there is a still a tie after that (which should be very unlikely but is good to account for), then just take the first one or choose randomly.
A good answer would have a function that returns the modal substring for a given number of words in the substring (where the number of words can be an arbitrary number).
An incredible answer would dispense with the 'given number of words' restriction and instead include a scalar (say \alpha) that manages the tradeoff between substring length (in words) and number of times it appears in the list. Alpha close to 1 would choose a modal substring that is very long (in words) but doesn't necessarily appear many times in the list. Alpha close to 0 would choose a modal substring that appears in as many times in the list as possible and doesn't care about substring length. I am not really expecting this though and will accept an answer that answers the original question.