Normalizing dictionary values

Question

I have a (quite large) dictionary that has numeric values, so for example in the form data = {'a': 0.2, 'b': 0.3, ...}. What is the best way to normalize these values (EDIT: make sure the values sum to 1)?

And what I'm especially interested in: Would it, for certain dataset size, be beneficial to use for example numpy instead of dict comprehension?

I'm using python 2.7.

Do you want to replace the values in your original dictionary, or do you want a new one? — Tim Pietzcker, May 07 '13 at 11:44
As @jamylak said: inplace modification is better here, since creating a new dict wouldn't be very memory-efficient and I can afford to lose the old data. — , May 07 '13 at 11:46

score 32 · Accepted Answer · edited May 06 '19 at 19:36

32

Try this to modify in place:

d={'a':0.2, 'b':0.3}
factor=1.0/sum(d.itervalues())
for k in d:
  d[k] = d[k]*factor

result:

>>> d
{'a': 0.4, 'b': 0.6}

Alternatively to modify into a new dictionary, use a dict comprehension:

d={'a':0.2, 'b':0.3}
factor=1.0/sum(d.itervalues())
normalised_d = {k: v*factor for k, v in d.iteritems() }

Note the use of d.iteritems() which uses less memory than d.items(), so is better for a large dictionary.

EDIT: Since there are quite a few of them, and getting this right seems to be important, I've summarised all the ideas in the comments to this answer together to the following (including borrowing something from this post):

import math
import operator

def really_safe_normalise_in_place(d):
    factor=1.0/math.fsum(d.itervalues())
    for k in d:
        d[k] = d[k]*factor
    key_for_max = max(d.iteritems(), key=operator.itemgetter(1))[0]
    diff = 1.0 - math.fsum(d.itervalues())
    #print "discrepancy = " + str(diff)
    d[key_for_max] += diff

d={v: v+1.0/v for v in xrange(1, 1000001)}
really_safe_normalise_in_place(d)
print math.fsum(d.itervalues())

Took a couple of goes to come up with dictionary that actually created a non-zero error when normalising but hope this illustrates the point.

EDIT: For Python 3.0. see the following change: Python 3.0 Wiki Built-in Changes

Remove dict.iteritems(), dict.iterkeys(), and dict.itervalues().

Instead: use dict.items(), dict.keys(), and dict.values() respectively.

edited May 06 '19 at 19:36

anilbey

1,817
4
22
38

answered May 07 '13 at 11:37

Benedict

2,771
20
21

1

better use `d.itervalues` (since you are iterating through the values twice & storing them in a list in memory if you use `d.values()`) – jamylak May 07 '13 at 11:47
Thanks - that's a good point. I've updated to use itervalues. – Benedict May 07 '13 at 11:58
1

To improve accuracy, use *math.fsum()* instead of *sum()*. With the OP's large dictionary, there's more potential for loss of accuracy during summation, resulting in a *factor* that is a little off. – Raymond Hettinger May 07 '13 at 11:59
Also, to make sure the sum is *exactly* one, there can be a final pass the "plug the difference" back into the values. For example, this is what people do when they split a dollar three ways into 33 cents, 33 cents, and 34 cents. – Raymond Hettinger May 07 '13 at 12:03
@RaymondHettinger do you need a final pass or can you just add the difference to a random key with a running sum? – jamylak May 07 '13 at 12:09
Actually you would probably need to `fsum` again so the final pass would be needed – jamylak May 07 '13 at 12:15
@jamylak I would make a final pass using *math.fsum()* and then plug the difference in the value with the largest magnitude. That should get you to a total of exactly 1.0 and would minimize the relative error of the adjusted value. – Raymond Hettinger May 07 '13 at 12:16
Might want to check if the sum of values is 0, otherwise you might get an error. – Ryan Bavetta Oct 11 '13 at 01:46

score 6 · Answer 2 · answered May 07 '13 at 11:40

6

def normalize(d, target=1.0):
   raw = sum(d.values())
   factor = target/raw
   return {key:value*factor for key,value in d.iteritems()}

Use it like this:

>>> data = {'a': 0.2, 'b': 0.3, 'c': 1.5}
>>> normalize(data)
{'b': 0.15, 'c': 0.75, 'a': 0.1}

answered May 07 '13 at 11:40

Tim Pietzcker

328,213
58
503
561

You need iteritems for python 2.x though. – May 07 '13 at 11:41
1

The dictionary is quite large so it would be better to modify it inplace – jamylak May 07 '13 at 11:42
@Jaapsneep: I didn't see the Python 2.7 reference when I first read the question. Thanks. – Tim Pietzcker May 07 '13 at 11:43

Normalizing dictionary values

2 Answers2

Linked