How do I count unique words in a list/set in python that is kind of nested/complicated

Question

I am trying to count unique words in a list/set (or whatever its called) that looks something like this:

names = [[], [], [], [], [], [['John ', 'John '], ['Peter ']], [], [], [], [['Morgan']], [], [], []]

(In case you need to know, this list was formed as a result of a match function that looks for a list of names in word documents in a directory on my computer. The empty spaces you see are documents that matched nothing)

So far I have tried

names1 = set(names)
len (names1)

And

Counter(names).keys() 
Counter(names).values()

but neither worked. Any help is appreciated

Here is a tip: recursively go into each list and count the words. — Jaideep Shekhar, Nov 28 '20 at 16:08
One option would be to flatten the list of lists of lists with techniques from the old question https://stackoverflow.com/questions/2158395/flatten-an-irregular-list-of-lists, then perform the count on the flattened list. — paisanco, Nov 28 '20 at 16:15
@JaideepShekhar there are words that repeat across lists. How do I ensure that the repetition is taken into account — Eisenheim, Nov 28 '20 at 16:16

score 3 · Accepted Answer · edited Nov 28 '20 at 16:27

3

This came to my mind:

from collections import defaultdict

d = defaultdict(int) # default int is 0
names = [[], [], [], [], [], [['John ', 'John '], ['Peter ']], [], [], [], [['Morgan']], [], [], []]

def find(ele):
    if isinstance(ele, str):
        d[ele] += 1
    
    if isinstance(ele, list):
        for e in ele:
            find(e)
    
find(names)
print(d) # {'John ': 2, 'Peter ': 1, 'Morgan': 1}

It's a recursive function that checks if it's a list. If it is, then it checks if it's not empty and proceeds to call itself again. Otherwise, it just returns. If it finds a string, it simply adds itself to the dictionary.

edited Nov 28 '20 at 16:27

hidden

141
5

answered Nov 28 '20 at 16:16

lionbigcat

803
6
13

1

This is a good answer, but your `if len(ele) > 0..else` is not necessary. If the length is zero, the for loop will simply not execute. – RufusVS Nov 28 '20 at 16:21
Good point. Didn't think it through and wanted to make everything as explicit as possible for the OP - or for any question/answer discussion. – lionbigcat Nov 28 '20 at 16:24
@lionbigcat Thankyou that worked. How do I make it case insensitive though. It is currently listing Morgan and MORGAN as two different words. – Eisenheim Nov 28 '20 at 16:39
@Eisenheim just convert it all to lowercase. – Jaideep Shekhar Nov 28 '20 at 16:40
@Eisenheim, `d[ele.lower()] += 1` – lionbigcat Nov 28 '20 at 16:42
@lionbigcat One more thing. I am trying to write it to excel using `pd.DataFrame(d) count.to_excel(writerrr, sheet_name='count', index=False)` but it throws this error: If using all scalar values, you must pass an index – Eisenheim Nov 28 '20 at 17:06
1

@Eisenheim take a look at this: https://stackoverflow.com/a/17840195/8721930. Googling error msgs, will more often than not, yield good results. – lionbigcat Nov 28 '20 at 17:15

score 0 · Answer 2 · answered Nov 28 '20 at 16:38

0

My attempt at making the comment by @Jaideep Shekhar more explicit, and also including your original use of the Counter object:

from collections import Counter

wordcount_dict = Counter()

for elem in names:
    for namelist in elem:
        wordcount_dict += Counter(namelist)

print(len(wordcount_dict))

answered Nov 28 '20 at 16:38

nenb

63
5

This solution doesn't handle nested lists. – RufusVS Nov 28 '20 at 17:12
That's true. I guess I took the OP's example too literally. – nenb Nov 28 '20 at 17:28
Actually, considering what he calls his data source, your interpretation probably is fine! I don't think his search results would be nested any further. – RufusVS Nov 28 '20 at 17:32

How do I count unique words in a list/set in python that is kind of nested/complicated

2 Answers2