Python: Finding unknown repeated word(s) in a list of strings

Question

I have a list of strings, which are subjects from different email conversations. I would like to see if there are words or word combinations which are being used frequently.

An example list would be:

subjects = [
              'Proposal to cooperate - Company Name',
              'Company Name Introduction',
              'Into Other Firm / Company Name',
              'Request for Proposal'
           ]

The function would have to detect that "Company Name" as combination is used more than once, and that "Proposal" is being used more than once. These words won't be known in advance though, so I guess it would have to start trying all possible combinations.

The actual list is of course a lot longer than this example, so manually trying all combinations doesn't seem like the best way to go. What would be the best way to go about this?

UPDATE

I've used Tim Pietzcker's answer to start developing a function for this, but I get stuck on applying the Counter correctly. It keeps returning the length of the list as count for all phrases.

The phrases function, including punctuation filter and a check if this phrase has already been checked, and a max length per phrase of 3 words:

def phrases(string, phrase_list):
  words = string.split()
  result = []
  punctuation = '\'\"-_,.:;!? '
  for number in range(len(words)):
      for start in range(len(words)-number):
        if number+1 <= 3:
          phrase = " ".join(words[start:start+number+1])
          if phrase in phrase_list:
            pass
          else:
            phrase_list.append(phrase)
            phrase = phrase.strip(punctuation).lower()
            if phrase:
               result.append(phrase)
  return result, phrase_list

And then the loop through the list of subjects:

phrase_list = []
ranking = {}
for s in subjects:
    result, phrase_list = phrases(s, phrase_list)
    all_phrases = collections.Counter(phrase.lower() for s in subjects for phrase in result)

"all_phrases" returns a list with tuples where each count value is 167, which is the length of the subject list I'm using. Not sure what I'm missing here...

This is not a duplicate. At least not of that particular question. This is not about items in a list, it's about common phrases in a list of strings. Please read more than the title before closing. — André Laszlo, Mar 03 '16 at 15:07
The suggested duplicate question in no way whatsoever answers my question... — Vincent, Mar 03 '16 at 15:12
@TimPietzcker Why? It is *Exactly* a duplicate of counting items in a list. He just needs to figure out that is what he wants. — Inbar Rose, Mar 03 '16 at 15:15
@InbarRose: He doesn't know beforehand *what* he wants to count. — Tim Pietzcker, Mar 03 '16 at 15:15
@InbarRose: That's the whole point of the question. Don't close questions as duplicates if you're not sure beforehand they actually are duplicate. It's not a race. — Vincent Savard, Mar 03 '16 at 15:16
How big will this list be? The possibilities could be endless, especially if you're not just counting words but also phrases... — Tim Pietzcker, Mar 03 '16 at 15:17
It's not strictly counting items in a list... is this of any assistance? http://stackoverflow.com/questions/18715688/find-common-substring-between-two-strings — David Zemens, Mar 03 '16 at 15:18
OP makes no attempt to solve for himself, no sample code, no idea what he wants. Clearly and obviously he needs to count words and get the most common ones, enter - counting items in a list..... http://stackoverflow.com/a/5829377/1561176 — Inbar Rose, Mar 03 '16 at 15:18
@TimPietzicker: The list can easily be a few hundred subjects. — Vincent, Mar 03 '16 at 15:18
@InbarRose Reducing the above problem to just counting occurrences of items in a list seems non-trivial, at least to me. Just enumerating and counting all possible phrases would lead to a combinatorial explosion, I think. Maybe it would make for an interesting answer? — André Laszlo, Mar 03 '16 at 15:19
@InbarRose: If you think the question is unclear, either vote to close it as unclear or ask for clarification in comments. A duplicate means that the question is _the same_. Just because you also have to count elements to achieve a solution for this problem doesn't mean it's the same question. In case of doubt, don't do anything. — Vincent Savard, Mar 03 '16 at 15:20
@VincentSavard I believed this would give the OP a good direction, and save time for everyone involved. It seems that I underestimated the OP's determination to not attempt to answer their own questions. — Inbar Rose, Mar 03 '16 at 15:21
@InbarRose: No, you just completely misunderstand the point of closing as duplicate. A gold Python badge doesn't give you the right to arbitrarily close Python questions as you see fit. That's not how moderation on Stack Overflow works. — Vincent Savard, Mar 03 '16 at 15:23
I wasn't abusing anything, I voted to close this is a duplicate, that was what I feel. Not my fault that the gold star means I did it all by myself. This is an edge case. a question that both has a simple answer (in the form of an easily linkable dupe) and has no attempt to answer by the OP himself. I was torn between the two close options..... sorry that the one I chose is in disagreement with what you would have chosen. — Inbar Rose, Mar 03 '16 at 15:24
@InbarRose: I found this to be quite an interesting problem - not so much counting occurrences but rather building a list of all possible phrases from a string. Haven't seen that here before, and I think my answer can be a good starting point for OP to find out what he may need to do to make it actually work for him as intended. — Tim Pietzcker, Mar 04 '16 at 07:28
You can't skip phrases by only adding the ones that haven't been seen yet - that means every phrase would only be counted once. You need to count all possible phrases. What are you trying to do with your solution? Do you want to count `"This phrase!"` as identical to `"This phrase"`, but consider `"This / phrase"` different from the two? — Tim Pietzcker, Mar 04 '16 at 12:32
I removed my previous 2 comments and decided to go for a simpler case. I want to match a combination of max 3 words, not case-dependent, and count punctuation in the string as valid characters. That means that "This / phrase" would be different from "This phrase!", but the latter would be the same as "This phrase". Will post an update with code in a few minutes. — Vincent, Mar 04 '16 at 12:55
Just updated the code to reflect my last comment. As expected the collections.Counter still counts every phrase 167 times. — Vincent, Mar 04 '16 at 13:15

score 2 · Accepted Answer · answered Mar 04 '16 at 07:20

You also want to find phrases that are composed of more than single words. No problem. This should even scale quite well.

import collections

subjects = [
              'Proposal to cooperate - Company Name',
              'Company Name Introduction',
              'Into Other Firm / Company Name',
              'Request for Proposal',
              'Some more Firm / Company Names'
           ]

def phrases(string):
    words = string.split()
    result = []
    for number in range(len(words)):
        for start in range(len(words)-number):
             result.append(" ".join(words[start:start+number+1]))
    return result

The function phrases() splits the input string on whitespace and returns all possible substrings of any length:

In [2]: phrases("A Day in the Life")
Out[2]:
['A',
 'Day',
 'in',
 'the',
 'Life',
 'A Day',
 'Day in',
 'in the',
 'the Life',
 'A Day in',
 'Day in the',
 'in the Life',
 'A Day in the',
 'Day in the Life',
 'A Day in the Life']

Now you can count how many times each of these phrases are found in all your subjects:

all_phrases = collections.Counter(phrase for subject in subjects for phrase in phrases(subject))

Result:

In [3]: print([(phrase, count) for phrase, count in all_phrases.items() if count > 1])
Out [3]:
[('Company', 4), ('Proposal', 2), ('Firm', 2), ('Name', 3), ('Company Name', 3), 
 ('Firm /', 2), ('/', 2), ('/ Company', 2), ('Firm / Company', 2)]

Note that you might want to use other criteria than simply splitting on whitespace, maybe ignore punctuation and case etc.

Thanks, this was a great start. I've implemented this in the loop but am having some troubles with the counter. I've updated the question with the latest status. — Vincent, Mar 04 '16 at 11:31

pp_ · Answer 2 · 2016-03-03T15:39:39.457

0

I would suggest that you use space as a separator, otherwise there are too many possibilities if you don't specify how an allowed 'phrase' should look like.

To count word occurrences you can use Counter from the collections module:

import operator
from collections import Counter

d = Counter(' '.join(subjects).split())

# create a list of tuples, ordered by occurrence frequency
sorted_d = sorted(d.items(), key=operator.itemgetter(1), reverse=True)

# print all entries that occur more than once
for x in sorted_d:
    if x[1] > 1:
        print(x[1], x[0])

Output:

3 Name
3 Company
2 Proposal

edited Mar 03 '16 at 15:39

answered Mar 03 '16 at 15:20

pp_

3,435
4
19
27

Thanks, this is helpful. Probably by first getting the repeated words, I can then start looking for word combinations, using the words this function finds. I'll play around with this a bit and post my result here. – Vincent Mar 03 '16 at 15:23
A potential alternative to using `split()` to tokenize a sentence, you could also use the `work_tokenize()` function from the `nltk`. http://www.nltk.org/book/ch03.html – BrockLee Mar 03 '16 at 15:33

score 0 · Answer 3 · answered Mar 03 '16 at 15:42

Similar to pp_'s answer. Using Split.

import operator

subjects = [
          'Proposal to cooperate - Company Name',
          'Company Name Introduction',
          'Into Other Firm / Company Name',
          'Request for Proposal'
       ]
flat_list = [item for i in subjects for item in i.split() ]
count_dict = {i:flat_list.count(i) for i in flat_list}
sorted_dict = sorted(count_dict.items(), reverse=True, key=operator.itemgetter(1))

Output:

[('Name', 3),
('Company', 3),
('Proposal', 2),
('Other', 1),
('/', 1),
('for', 1),
('cooperate', 1),
('Request', 1),
('Introduction', 1),
('Into', 1),
('-', 1),
('to', 1),
('Firm', 1)]

Python: Finding unknown repeated word(s) in a list of strings

3 Answers3