finding frequencies of pair items in a list of pairs

Question

Let's say I have a long list of this type:

text = [ ['a', 'b'], ['a', 'd'], ['w', 'a'], ['a', 'b'], ... ]

Given the first elements, I want to construct a dictionary that would show a count of the second elements. For example in the particular example above, I'd like to have something like this:

{'a': {'b':2, 'd':1},
 'w': {'a':1}
}

Here's how I unsuccessfully tried to solve it. I constructed a list of unique first elements. Let's call it words and then:

dic = {}

for word in words:
  inner_dic = {}
  for pair in text:
    if pair[0] == word:
      num = text.count(pair)
      inner_dic[pair[1]] = num
  dic[pair[0]] = inner_dic

I get an obviously erroneous result. One problem with the code is, it overcounts pairs. I am not sure how to solve this.

using .count is not efficient – Padraic Cunningham Feb 15 '15 at 16:44 — Padraic Cunningham, Feb 15 '15 at 16:44

Padraic Cunningham · Answer 1 · 2015-02-15T16:48:25.267

You can use a defaultdict combined with a Counter dict:

from collections import Counter, defaultdict
d = defaultdict(Counter)

text = [ ['a', 'b'], ['a', 'd'], ['w', 'a'], ['a', 'b'] ]

for k, v in text:
    d[k][v] += 1 # for single value
   # d[k].update(v) for multiple values i.e list of words

from pprint import pprint as pp

pp(d)
{'a': Counter({'b': 2, 'd': 1}),
'w': Counter({'a': 1})}

The defaultdict will create a new key/value pairing where the value is a Counter dict if the key does not exist, if the key exists we just update the value using Counter.update which unlike dict.update will increment the value not overwrite.

using a normal dict without imports we can use dict.setdefault which will create a new dict as a value for each key k and set a default value of 0 for each subkey v:

d = {}
text = [ ['a', 'b'], ['a', 'd'], ['w', 'a'], ['a', 'b'] ]

for k, v in text:
    d.setdefault(k, {}).setdefault(v,0)
    d[k][v] += 1
pp(d)
{'a': {'b': 2, 'd': 1}, 'w': {'a': 1}}

I want to save the final dictionary using `json`. Sorry if the question is too naive, but does that superfluous `Counter` method appear in the final text file if I dump it as-is? — Morteza R, Feb 15 '15 at 16:22
@schmutter, you can cast to a normal dict at the end if required, is there a particular reason you want to? — Padraic Cunningham, Feb 15 '15 at 16:23

Raymond Hettinger · Answer 2 · 2015-02-15T16:37:47.287

The collections module makes short work of tasks like this.

Use a Counter for the counting part (it is a kind of dictionary that returns 0 for missing values, making it easy to use +=1 for incrementing counts). Use defaultdict for the outer dict (it can automatically make a new counter for each "first" prefix):

>>> from collections import defaultdict, Counter
>>> d = defaultdict(Counter)
>>> text = [ ['a', 'b'], ['a', 'd'], ['w', 'a'], ['a', 'b']]
>>> for first, second in text:
    d[first][second] += 1

Here is the equivalent using regular dictionaries:

text = [ ['a', 'b'], ['a', 'd'], ['w', 'a'], ['a', 'b']]

d = {}
for first, second in text:
    if first not in d:
        d[first] = {}
    inner_dict = d[first]
    if second not in inner_dict:
        inner_dict[second] = 0
    inner_dict[second] += 1

Either the short way or the long way will work perfectly with the json module (both Counter and defaultdict are kinds of dicts that can be JSON encoded).

Hope this helps. Good luck with your text analysis :-)

score 5 · Accepted Answer · answered Feb 15 '15 at 16:21

5

You should do this instead:

for word in words:
  inner_dic = {}
  for pair in text:
    if pair[0] == word:
      num = text.count(pair)
      inner_dic[pair[1]] = num
  dic[word] = inner_dic

that is, you should be doing dic[word] rather than dic[pair[0]], which will assign the inner_dic to the first element in the last pair checked, even if pair[0] isn't word.

answered Feb 15 '15 at 16:21

rlms

10,650
8
44
61

2

Thanks. This is the best answer, because you actually read my code and saw where it went wrong instead of writing something from scratch using ready-made libraries. Kudos! – Morteza R Feb 15 '15 at 16:28
I believe there is a separate forum for just code-review. On SO, there should be some value placed on answers that demonstrate the most Pythonic solutions so that others may benefit. Otherwise, we're left with code that is inefficient (such as *str.count*), that uses indexing instead of unpacking (such as ``pair[0]`` and pair[1]`` versus ``for first, second in text``), and that avoids standard library solutions that were specifically designed to solve exactly this kind of problem). So, while you may be happy that someone spotted the error in your code, the answer you selected isn't that great. – Raymond Hettinger Feb 15 '15 at 23:42
@RaymondHettinger: Thanks for your input. I agree that if only efficiency is taken into account, there are better answers. But nobody initially pointed out the inefficiency in my small snippet of code. There should also be some value placed on answers that take into account the OP's attempts and constraints. Otherwise it becomes a contest of who posts the most Pythonic code sooner, without any regard to the thinking process of the person who is trying to solve their problem. At any rate, thanks for your great answer and I apologize if you think I didn't do justice. – Morteza R Feb 16 '15 at 09:30
@RaymondHettinger I would agree that your answer contains better code (it got +1 from me), but I think in good questions which include code it is usually helpful to try and answer based on that code - speaking for myself, I learn more from people correcting my code than providing complete alternatives. – rlms Feb 17 '15 at 17:12
I agree fully and gave a +1 to this answer. I just wish the selected answers represented the best practices so that future users of SO who read this will find the right way to do things and have a nice model for how to make their own variants. – Raymond Hettinger Feb 17 '15 at 20:43

dawg · Answer 4 · 2015-02-15T18:36:54.360

1

Here is a way using the .setdefault method:

text = [ ['a', 'b'], ['a', 'd'], ['w', 'a'], ['a', 'b'] ]
result={}
for x, y in text:
    result.setdefault(x, {}).setdefault(y,0)
    result[x][y]+=1

>>> result 
{'a': {'b': 2, 'd': 1}, 'w': {'a': 1}}

No external library required.

edited Feb 15 '15 at 18:36

answered Feb 15 '15 at 18:20

dawg

98,345
23
131
206

+1 for the straight-forward use of *dict.setdefault* which was designed to solve exactly this problem. Too bad the OP is averse to using the standard library which offers even cleaner solutions. – Raymond Hettinger Feb 15 '15 at 23:49

score 0 · Answer 5 · answered Feb 15 '15 at 16:29

0

text = [ ['a', 'b'], ['a', 'd'], ['w', 'a'], ['a', 'b']]
d = {}
for i in text:
    if d.get(i[0]):
        if d[i[0]].get(i[1]):
            d[i[0]][i[1]] +=1
        else:
            d[i[0]][i[1]] = 1 
    else:
        d[i[0]] = {i[1] : 1}
print d
>>>{'a': {'b': 2, 'd': 1}, 'w': {'a': 1}}

answered Feb 15 '15 at 16:29

itzMEonTV

19,851
4
39
49

The good news is that this gets the right answer. The slightly bad news is that using *dict.get* instead of the *in*-operator is slower, less clear, and a bit risky (for applications where *None* is a valid value in a dict). The mostly bad news is that this code is far from Pythonic -- the repeated use of indexing, ``i[0]`` and ``i[1]`` makes the code almost unreadable. – Raymond Hettinger Feb 15 '15 at 23:47

finding frequencies of pair items in a list of pairs

5 Answers5

Linked