Python loop/comprehension for a nested word count

Question

I'm working on analyzing some user data, and I've got a list of (preprocessed to lowercase) usernames, something like this: name_list = ['joebob', 'sallycat', 'bigbenny', 'davethepirate', 'nightninja', ...(many more)] and a dictionary of comparisons I'd like to run on those names to see how often certain words show up compared to certain others. For example...

comparisons = {"Pirates vs Ninjas": ["pirate", "ninja"],
               "Cats vs Dogs": ["cat", "dog"]}

I'm trying to get a loop/comprehension with output that would look like

{"Pirates vs Ninjas": {"pirate": 224, "ninja": 342},
 "Cats vs Dogs": {"cat": 430, "dog": 391}}

(With the numbers above just being examples of end result word counts)

I know all the individual components necessary to make it work (dictionary comprehensions and dict.get). What is the right way to put it all together?

Edit for clarification: I want to see how many usernames contain the word "cat", and record that next to a number that contain the word "dog". The results will be logged in a dict with a key "Cats vs Dogs". I would then proceed to do the same with the next comparison, "Pirates vs Ninjas".

I don't really get what you want to achieve, however try to also have a look at nltk's counter method https://stackoverflow.com/questions/10677020/real-word-count-in-nltk/25686874 maybe you can tokenize your comparisons and then count the word values, is that what you want? — T. Kelher, Jun 16 '21 at 20:35
I had intended for that bit in the first paragraph "to see how often certain words show up compared to others" would cover it but i might not have been clear. Yes it's how often the word "cat" appears across all usernames compared to how often "dog" does. Once that's done, do a similar comparison of "pirate" vs "ninja" in usernames. Repeat for all comparisons. — Josh, Jun 16 '21 at 20:48

Mustafa Aydın · Accepted Answer · 2021-06-16T21:38:16.847

2

from collections import Counter

c = Counter(user_names)

result = {category: {entry: c[entry] for entry in entries}
          for category, entries in comparisons.items()}

First running a Counter over the list to get a username -> count mapping and then using a dict & list comprehension through the comparisons. The counter gives 0 if entry doesn't exist in it.

Above, for example:

category == "Pirates vs Ninjas"
entry == "pirate"
entries == ["pirate", "ninja"]

Sample data:

user_names = ["pirate", "dog", "this", "ninja", "that", "cat", "cat", "ninja", "other", "cat"]

c = Counter(user_names)

result = {category: {entry: c[entry] for entry in entries}
          for category, entries in comparisons.items()}

then

>>> result

{"Pirates vs Ninjas": {"pirate": 1, "ninja": 2}, "Cats vs Dogs": {"cat": 3, "dog": 1}}

If looking to allow for case-insensitive and partial matches, we won't use Counter but sum:

result = {category: {entry: sum(entry in name for name in user_names) 
                                for entry in map(str.lower, entries)}
          for category, entries in comparisons.items()}

where we first map the entries to lower case prior to searching and we not only count exact matches but count "contains" type matches via in operator and sum.

edited Jun 16 '21 at 21:38

answered Jun 16 '21 at 20:35

Mustafa Aydın

17,645
4
15
38

It's super close, I get exactly the type of output I'm looking for, but it doesn't actually seem to be counting from user_names, as it's all comes up 0 (when I know for certain all the words are there multiple times. My full code is just what you have there, plus importing the list of usernames and the definition of comparisons shown above. – Josh Jun 16 '21 at 21:13
@Josh I added a sample `user_name` list and the same `comparisons` dict is used as in the question; they seem to be nonzero. But two things: you'd like to search case-insensitive and not necessarily exact match but "contains" kind of match as I understand from the edit; are these correct? – Mustafa Aydın Jun 16 '21 at 21:34
@Josh For those 2 cases, I added and tried to explain a snippet at the very end; doesn't use `Counter` but is case-insensitive and counts also partial matches; hope it helps. – Mustafa Aydın Jun 16 '21 at 21:39
1

Thank you! Case wasn't an issue as I had preprocessed to lowercase assuming easier matching but yes it was the containment portion that the counter was missing. This worked exactly as intended and surprisingly fast! Thanks again – Josh Jun 17 '21 at 02:34

Python loop/comprehension for a nested word count

1 Answers1