0

I am trying to count unique words in a list/set (or whatever its called) that looks something like this:

names = [[], [], [], [], [], [['John ', 'John '], ['Peter ']], [], [], [], [['Morgan']], [], [], []]

(In case you need to know, this list was formed as a result of a match function that looks for a list of names in word documents in a directory on my computer. The empty spaces you see are documents that matched nothing)

So far I have tried

names1 = set(names)
len (names1)

And

Counter(names).keys() 
Counter(names).values()

but neither worked. Any help is appreciated

Eisenheim
  • 67
  • 9
  • 1
    Here is a tip: recursively go into each list and count the words. – Jaideep Shekhar Nov 28 '20 at 16:08
  • 1
    One option would be to flatten the list of lists of lists with techniques from the old question https://stackoverflow.com/questions/2158395/flatten-an-irregular-list-of-lists, then perform the count on the flattened list. – paisanco Nov 28 '20 at 16:15
  • @JaideepShekhar there are words that repeat across lists. How do I ensure that the repetition is taken into account – Eisenheim Nov 28 '20 at 16:16
  • 2
    I don't understand why you expect that to cause a problem. – Karl Knechtel Nov 28 '20 at 16:19

2 Answers2

3

This came to my mind:

from collections import defaultdict

d = defaultdict(int) # default int is 0
names = [[], [], [], [], [], [['John ', 'John '], ['Peter ']], [], [], [], [['Morgan']], [], [], []]

def find(ele):
    if isinstance(ele, str):
        d[ele] += 1
    
    if isinstance(ele, list):
        for e in ele:
            find(e)
    
find(names)
print(d) # {'John ': 2, 'Peter ': 1, 'Morgan': 1}

It's a recursive function that checks if it's a list. If it is, then it checks if it's not empty and proceeds to call itself again. Otherwise, it just returns. If it finds a string, it simply adds itself to the dictionary.

hidden
  • 141
  • 5
lionbigcat
  • 803
  • 6
  • 13
  • 1
    This is a good answer, but your `if len(ele) > 0..else` is not necessary. If the length is zero, the for loop will simply not execute. – RufusVS Nov 28 '20 at 16:21
  • Good point. Didn't think it through and wanted to make everything as explicit as possible for the OP - or for any question/answer discussion. – lionbigcat Nov 28 '20 at 16:24
  • @lionbigcat Thankyou that worked. How do I make it case insensitive though. It is currently listing Morgan and MORGAN as two different words. – Eisenheim Nov 28 '20 at 16:39
  • @Eisenheim just convert it all to lowercase. – Jaideep Shekhar Nov 28 '20 at 16:40
  • @Eisenheim, `d[ele.lower()] += 1` – lionbigcat Nov 28 '20 at 16:42
  • @lionbigcat One more thing. I am trying to write it to excel using `pd.DataFrame(d) count.to_excel(writerrr, sheet_name='count', index=False)` but it throws this error: If using all scalar values, you must pass an index – Eisenheim Nov 28 '20 at 17:06
  • 1
    @Eisenheim take a look at this: https://stackoverflow.com/a/17840195/8721930. Googling error msgs, will more often than not, yield good results. – lionbigcat Nov 28 '20 at 17:15
0

My attempt at making the comment by @Jaideep Shekhar more explicit, and also including your original use of the Counter object:

from collections import Counter

wordcount_dict = Counter()

for elem in names:
    for namelist in elem:
        wordcount_dict += Counter(namelist)

print(len(wordcount_dict))
nenb
  • 63
  • 5