0

Possible Duplicate:
How to raise error if duplicates keys in dictionary

I was recently generating huge dictionaries with hundreds of thousands of keys (such that noticing a bug by looking at them wasn't feasible). They were syntactically correct, yet there was a bug somewhere. It boiled down to "duplicate keys":

{'a':1, ..., 'a':2}

this code compiles fine and I could not figure out why a key has value of 2 as I expected 1. The problem is obvious now.

The question is how I can prevent that in the future. I think this is impossible within python. I used

grep "'.*'[ ]*:" myfile.py | sort | uniq -c | grep -v 1

which is not bulletproof. Any other ideas (within python, this grep is just to illustrate what I'd tried)?

EDIT: I don't want duplicate keys, just need to spot that this occurs and edit data manually

Community
  • 1
  • 1
  • Is the problem that the duplicate keys are in your data? Meaning you want them flagged (or duplicates ignored). – Dan Goldsmith Oct 10 '12 at 09:42
  • The simplest approach would be to create and use a custom subclass of dict (see link to question above) which fails with an informative error message when you attempt to add duplicate keys. You can even modify the behaviour to ignore duplicates if that is really what you want. – Shawn Chin Oct 10 '12 at 09:43
  • 1
    Now i see that this is, in fact, a dupe, the answer linked solves my problem. I wonder why is such a behaviour (informing the user that he's duplicating keys) is not default in Python, I can't imagine a situation when doubling keys would be desirable. Should I close/remove this question? (how?) –  Oct 10 '12 at 10:26
  • 2
    the `dict` object in memory **can't** contain duplicate keys (equal objects with the same hash). How do you serialize the dictionary to a file? – jfs Oct 10 '12 at 10:51

4 Answers4

0

A dict cannot contain double keys. So all you need to do is execute the code and then dump the repr() of the dict.

Another option is creating the dict items as (key, value) tuples. By storing them in a list you can easily create a dict from them and then check if the len()s of the dict/list differ.

ThiefMaster
  • 310,957
  • 84
  • 592
  • 636
  • The problem is that a dict **can** contain double keys as my problem demonstrated but it reassigns new values silently. How is `repr()` going to help me? Remember that the dictionary is huge and I cannot infer anything by looking at MBs of dumped data. I find the second part of your answer valuable though, although it lowers readability of my code a bit. Thanks –  Oct 10 '12 at 10:21
  • It will not detect them but you can create a new dict without those dupes. – ThiefMaster Oct 10 '12 at 10:34
  • @user35186: A Python dictionary **cannot** contain duplicate keys. On the other hand a [Dictionary Display](http://docs.python.org/reference/expressions.html#dictionary-displays), which is basically a string representation that could be used to construct a dictionary, can contain them. This is what you show in your question. You need to detect duplicates in the dictionary display you're creating. – martineau Oct 10 '12 at 15:27
  • Sorry, I didn't make myself clear, of course it cannot contain duplicate keys in memory as you and others point out but it can in code, in its string representation and produce valid (but misleading) syntax. –  Oct 10 '12 at 19:25
0

If you need to have multiple values per key you can store the values in a list using defaultdict.

>>> from collections import defaultdict
>>> data_dict = defaultdict(list)
>>> data_dict['key'].append('value')
>>> data_dict
defaultdict(<type 'list'>, {'key': ['value']})
>>> data_dict['key'].append('second_value')
>>> data_dict
defaultdict(<type 'list'>, {'key': ['value', 'second_value']})
root
  • 76,608
  • 25
  • 108
  • 120
0

Are you generating a Python file containing a giant dictionary? Something like:

print "{"
for lines in file:
    key, _, value = lines.partition(" ")
    print "    '%s': '%s',"
print "}"

If so, there's not much you can do to prevent this, as you cannot easily override the construction of the builtin dict.

Instead I'd suggest you validate the data while constructing the dictionary string. You could also generate different syntax:

dict(a = '1', a = '2')

..which will generate a SyntaxError if the key is duplicated. However, these are not exactly equivalent, as dictionary keys are a lot more flexible than keyword-args (e.g {123: '...'} is valid, butdict(123 = '...')` is an error)

You could generate a function call like:

uniq_dict([('a', '...'), ('a', '...')])

Then include the function definition:

def uniq_dict(values):
    thedict = {}

    for k, v in values:
        if k in thedict:
            raise ValueError("Duplicate key %s" % k)
        thedict[k] = v

     return thedict
dbr
  • 165,801
  • 69
  • 278
  • 343
0

You don't say or show exactly how you're generating the dictionary display you have where the duplicate keys are appearing. But that is where the problem lies.

Instead of using something like {'a':1, ..., 'a':2} to construct the dictionary, I suggest that you use this form: dict([['a', 1], ..., ['a', 2]]) which will create one from a supplied list of [key, value] pairs. This approach will allow you to check the list of pairs for duplicates before passing it to dict() to do the actual construction of the dictionary.

Here's an example of one way to check the list of pairs for duplicates:

sample = [['a', 1], ['b', 2], ['c', 3], ['a', 2]]

def validate(pairs):
    # check for duplicate key names and raise an exception if any are found
    dups = []
    seen = set()
    for key_name,val in pairs:
        if key_name in seen:
            dups.append(key_name)
        else:
            seen.add(key_name)
    if dups:
        raise ValueError('Duplicate key names encountered: %r' % sorted(dups))
    else:
        return pairs

my_dict = dict(validate(sample))
martineau
  • 119,623
  • 25
  • 170
  • 301