16

I have a (quite large) dictionary that has numeric values, so for example in the form data = {'a': 0.2, 'b': 0.3, ...}. What is the best way to normalize these values (EDIT: make sure the values sum to 1)?

And what I'm especially interested in: Would it, for certain dataset size, be beneficial to use for example numpy instead of dict comprehension?

I'm using python 2.7.

jamylak
  • 128,818
  • 30
  • 231
  • 230

2 Answers2

32

Try this to modify in place:

d={'a':0.2, 'b':0.3}
factor=1.0/sum(d.itervalues())
for k in d:
  d[k] = d[k]*factor

result:

>>> d
{'a': 0.4, 'b': 0.6}

Alternatively to modify into a new dictionary, use a dict comprehension:

d={'a':0.2, 'b':0.3}
factor=1.0/sum(d.itervalues())
normalised_d = {k: v*factor for k, v in d.iteritems() }

Note the use of d.iteritems() which uses less memory than d.items(), so is better for a large dictionary.

EDIT: Since there are quite a few of them, and getting this right seems to be important, I've summarised all the ideas in the comments to this answer together to the following (including borrowing something from this post):

import math
import operator

def really_safe_normalise_in_place(d):
    factor=1.0/math.fsum(d.itervalues())
    for k in d:
        d[k] = d[k]*factor
    key_for_max = max(d.iteritems(), key=operator.itemgetter(1))[0]
    diff = 1.0 - math.fsum(d.itervalues())
    #print "discrepancy = " + str(diff)
    d[key_for_max] += diff

d={v: v+1.0/v for v in xrange(1, 1000001)}
really_safe_normalise_in_place(d)
print math.fsum(d.itervalues())

Took a couple of goes to come up with dictionary that actually created a non-zero error when normalising but hope this illustrates the point.

EDIT: For Python 3.0. see the following change: Python 3.0 Wiki Built-in Changes

Remove dict.iteritems(), dict.iterkeys(), and dict.itervalues().

Instead: use dict.items(), dict.keys(), and dict.values() respectively.

anilbey
  • 1,817
  • 4
  • 22
  • 38
Benedict
  • 2,771
  • 20
  • 21
  • 1
    better use `d.itervalues` (since you are iterating through the values twice & storing them in a list in memory if you use `d.values()`) – jamylak May 07 '13 at 11:47
  • Thanks - that's a good point. I've updated to use itervalues. – Benedict May 07 '13 at 11:58
  • 1
    To improve accuracy, use *math.fsum()* instead of *sum()*. With the OP's large dictionary, there's more potential for loss of accuracy during summation, resulting in a *factor* that is a little off. – Raymond Hettinger May 07 '13 at 11:59
  • Also, to make sure the sum is *exactly* one, there can be a final pass the "plug the difference" back into the values. For example, this is what people do when they split a dollar three ways into 33 cents, 33 cents, and 34 cents. – Raymond Hettinger May 07 '13 at 12:03
  • @RaymondHettinger do you need a final pass or can you just add the difference to a random key with a running sum? – jamylak May 07 '13 at 12:09
  • Actually you would probably need to `fsum` again so the final pass would be needed – jamylak May 07 '13 at 12:15
  • @jamylak I would make a final pass using *math.fsum()* and then plug the difference in the value with the largest magnitude. That should get you to a total of exactly 1.0 and would minimize the relative error of the adjusted value. – Raymond Hettinger May 07 '13 at 12:16
  • Might want to check if the sum of values is 0, otherwise you might get an error. – Ryan Bavetta Oct 11 '13 at 01:46
6
def normalize(d, target=1.0):
   raw = sum(d.values())
   factor = target/raw
   return {key:value*factor for key,value in d.iteritems()}

Use it like this:

>>> data = {'a': 0.2, 'b': 0.3, 'c': 1.5}
>>> normalize(data)
{'b': 0.15, 'c': 0.75, 'a': 0.1}
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561