I have a list of strings, which are subjects from different email conversations. I would like to see if there are words or word combinations which are being used frequently.
An example list would be:
subjects = [
'Proposal to cooperate - Company Name',
'Company Name Introduction',
'Into Other Firm / Company Name',
'Request for Proposal'
]
The function would have to detect that "Company Name" as combination is used more than once, and that "Proposal" is being used more than once. These words won't be known in advance though, so I guess it would have to start trying all possible combinations.
The actual list is of course a lot longer than this example, so manually trying all combinations doesn't seem like the best way to go. What would be the best way to go about this?
UPDATE
I've used Tim Pietzcker's answer to start developing a function for this, but I get stuck on applying the Counter correctly. It keeps returning the length of the list as count for all phrases.
The phrases function, including punctuation filter and a check if this phrase has already been checked, and a max length per phrase of 3 words:
def phrases(string, phrase_list):
words = string.split()
result = []
punctuation = '\'\"-_,.:;!? '
for number in range(len(words)):
for start in range(len(words)-number):
if number+1 <= 3:
phrase = " ".join(words[start:start+number+1])
if phrase in phrase_list:
pass
else:
phrase_list.append(phrase)
phrase = phrase.strip(punctuation).lower()
if phrase:
result.append(phrase)
return result, phrase_list
And then the loop through the list of subjects:
phrase_list = []
ranking = {}
for s in subjects:
result, phrase_list = phrases(s, phrase_list)
all_phrases = collections.Counter(phrase.lower() for s in subjects for phrase in result)
"all_phrases" returns a list with tuples where each count value is 167, which is the length of the subject list I'm using. Not sure what I'm missing here...