Python remove duplicate values of one key in dict

Question

I have a dictionary like this:

Files:
{'key1': ['path1', 'path1', 'path2', 'path1', 'path2'], 
'key2': ['f', 'f', 'f', 'f', 'f'], 
'key_file': ['file1', 'file1', 'file2', 'file1', 'file2']}

I want to delete all the duplicate values und in 'key_file' and their other values in the other keys ('key1' and 'key2').

Desired dictionary:

Files:
{'key1': ['path1', 'path2'], 
'key2': ['f', 'f'], 
'key_file': ['file1', 'file2']}

I couldn't figure out a solution which preserved the order and deleted every duplicate item and their values in the other keys.

Thanks a lot.

EDIT:

'key2': ['f', 'f', 'f', 'f', 'f']

becomes

'key2': ['f', 'f'],

because there are two distinct files.

I don't want to delete every duplicate in every key. 'path1' is related to 'file1' and 'path2' is related to 'file2' as is the 'f' in key2 for both cases. Actually in reality there are several keys more, but this is my minimal example. That is my problem. I have found several solutions to delete every duplicate.

EDIT2:

Maybe I was a bit confusing.

Every key has the same length as they describe a filename (in key_file), the according path (in key1) and some other describing strings (in key2, etc). It can happen that the same file is stored in different locations (paths), but I know, that it is the same file if the filename is exactly the same.

Basically what I was looking for, is a function which detects the second value of key_file with the filename file1 as a duplicate of the first value file1 and deletes the second value from every key. The same for value number 4 (file1) and 5 (file2). The resulting dictionary would then look like the one I mentioned.

I hope this explains it better.

To remove the duplicates see this question: http://stackoverflow.com/questions/480214/how-do-you-remove-duplicates-from-a-list-in-python-whilst-preserving-order Beyond that, it's simply a loop through the items of the dict :) — Wolph, Jan 15 '15 at 15:16
Is it possible that there is a triplet of entries with same `key_file` but different `key1` or `key2`? — fredtantini, Jan 15 '15 at 15:58
Please reformulate the question to reflect what you actually want. It seems like you need every value in the dictionary to be a list containing two strings. It's not clear which duplicates should be discarded. What if your input was this: ```{'key1':['a'], 'key2':['a','b','c']}``` — Håken Lid, Jan 15 '15 at 16:03
@fredtantini: Yes, the same file in key_file can be stored in different paths in key1. For this case I only need the first instance of the name in key_file and their according path in key1 — Keynaan, Jan 16 '15 at 07:52
@HåkenLid: In my case your suggested input is not possible as every value in key_file has a path in key1 — Keynaan, Jan 16 '15 at 07:53

fredtantini · Accepted Answer · 2015-01-15T16:11:03.207

A naive approach: iterate over the keys and add to a new dict each values:

>>> newFiles={'key1': [], 'key2':[], 'key_file':[]}
>>> for i,j in enumerate(Files['key_file']):
...   if j not in newFiles['key_file']:
...      for key in newFiles.keys():
...         newFiles[key].append(Files[key][i])
...
>>> newFiles
{'key2': ['1', '3'], 'key1': ['a', 'c'], 'key_file': ['file1', 'file2']}

with OrderedDict:

>>> for j in OrderedDict.fromkeys(Files['key_file']):
...   i = Files['key_file'].index(j)
...   if j not in newFiles['key_file']:
...     for key in newFiles.keys():
...       newFiles[key].append(Files[key][i])
...
>>> newFiles
{'key2': ['1', '3'], 'key1': ['a', 'c'], 'key_file': ['file1', 'file2']}

Note: if a "file" in key_file always has the same key_1 and key_2, there are better ways. For instance using zip:

>>> z=zip(*Files.values())
>>> z
[('f', 'path1', 'file1'), ('f', 'path1', 'file1'), ('f', 'path2', 'file2'), ('f', 'path1', 'file1'), ('f', 'path2', 'file2')]
>>> OrderedDict.fromkeys(z)
OrderedDict([(('f', 'path1', 'file1'), None), (('f', 'path2', 'file2'), None)])
>>> list(OrderedDict.fromkeys(z))
[('f', 'path1', 'file1'), ('f', 'path2', 'file2')]
>>> zip(*OrderedDict.fromkeys(z))
[('file1', 'file2'), ('path1', 'path2'), ('f', 'f')]

Thanks a lot. That looks like the solution I was searching for. — Keynaan, Jan 15 '15 at 15:45
What if there is a triplet of entries with same `key_file` but different `key1` or `key2`? — tobias_k, Jan 15 '15 at 15:54
@tobias_k haven't thought about this case. Not sure it can happen though. I have asked OP for clarification. — fredtantini, Jan 15 '15 at 16:05
@tobias_k: Justified question, but I my (propably very special) case the different other keys are irrelevant. — Keynaan, Jan 16 '15 at 08:09

score 1 · Answer 2 · answered Jan 15 '15 at 15:17

1

OrderedDict is the best as it preserves order

You can add it to a set and then make it a list

Example

for i in d:
    d[i] = list(set(d[i]))

answered Jan 15 '15 at 15:17

Bhargav Rao

50,140
28
121
140

Mazdak · Answer 3 · 2015-01-15T15:45:50.393

You can use collections.OrderedDict to keep your dictionary in order and set to remove the duplicates :

>>> d={'key1': ['path1', 'path1', 'path2', 'path1', 'path2'], 
... 'key2': ['f', 'f', 'f', 'f', 'f'], 
... 'key_file': ['file1', 'file1', 'file2', 'file1', 'file2']}
>>> from collections import OrderedDict
>>> OrderedDict(sorted([(i,list(set(j))) for i,j in d.items()], key=lambda t: t[0]))
OrderedDict([('key1', ['path2', 'path1']), ('key2', ['f']), ('key_file', ['file2', 'file1'])])

you need to use set for values to remove duplicates then sort your items based on keys and finally to keep your dictionary in sort use OrderedDict.

Edit : if you want to all values have the same length as max value use the following :

>>> s=sorted([(i,list(set(j))) for i,j in d.items()], key=lambda t: t[0])
>>> M=max(map(len,[i[1] for i in s])
>>> f_s=[(i,j) if len(j)==M else (i,[j[0] for t in range(M)]) for i,j in s]
>>> f_s
[('key1', ['path2', 'path1']), ('key2', ['f', 'f']), ('key_file', ['file2', 'file1'])]
>>> OrderedDict(f_s)
OrderedDict([('key1', ['path2', 'path1']), ('key2', ['f', 'f']), ('key_file', ['file2', 'file1'])])

but if you just want the first 2 element of any values you can use slicing :

>>> OrderedDict(sorted([(i,j[:2]) for i,j in d.items()],key=lambda x: x[0])
... )
OrderedDict([('key1', ['path1', 'path1']), ('key2', ['f', 'f']), ('key_file', ['file1', 'file1'])])

Thank you for the fast answer. I edited my question above. I forgot to mention, that I want to preserve the length of each key. — Keynaan, Jan 15 '15 at 15:37
@Keynaan welcome so you mean that you want to all the values have the same length with long value ? — Mazdak, Jan 15 '15 at 15:39
In this case, I want 2 values in every key. @fretantini had the answer, but thank you as well. — Keynaan, Jan 15 '15 at 15:44

score 0 · Answer 4 · answered Jan 15 '15 at 15:53

As I understand the question, it seems that corresponding values in the different lists in the dictionary belong together, while values within the same list are unrelated to each other. In this case, I'd suggest using a different data structure. Instead of having a dictionary with three lists of items, you can make one list holding triplets.

>>> files = {'key1': ['path1', 'path1', 'path2', 'path1', 'path2'], 
             'key2': ['f', 'f', 'f', 'f', 'f'], 
             'key_file': ['file1', 'file1', 'file2', 'file1', 'file2']}
>>> files2 = set(zip(files["key1"], files["key2"], files["key_file"]))
>>> print files2
set([('path2', 'f', 'file2'), ('path1', 'f', 'file1')])

Or if you want to make it more dictionary-like, you could do this, afterwards:

>>> files3 = [{"key1": k1, "key2": k2, "key_file": kf} for k1, k2, kf in files2]
>>> print files3
[{'key2': 'f', 'key1': 'path2', 'key_file': 'file2'}, 
 {'key2': 'f', 'key1': 'path1', 'key_file': 'file1'}]

Note that the order of the triplets in the top-level list may be different, but items that belong together are still together in the contained tuples or dictionaries.

score 0 · Answer 5 · answered Jan 15 '15 at 16:33

Here is my implementation:

In [1]: mydict = {'key1': ['path1', 'path1', 'path2', 'path1', 'path2'], 'key2': ['f', 'f', 'f', 'f', 'f'], 'key_file': ['file1', 'file1', 'file2', 'file1', 'file2']}

In [2]: { k: sorted(list(set(v))) for (k,v) in mydict.iteritems() }
Out[2]: {'key1': ['path1', 'path2'], 'key2': ['f'], 'key_file': ['file1', 'file2']}

Test

In [6]: mydict
Out[6]:
{'key1': ['path1', 'path1', 'path2', 'path1', 'path2'],
 'key2': ['f', 'f', 'f', 'f', 'f'],
 'key_file': ['file1', 'file1', 'file2', 'file1', 'file2']}

In [7]: uniq = { k: sorted(list(set(v))) for (k,v) in mydict.iteritems() }

In [8]: for key in uniq:
   ...:     print 'KEY    :', key
   ...:     print 'VALUE  :', uniq[key]
   ...:     print '-------------------'
   ...: 
KEY    : key2
VALUE  : ['f']
-------------------
KEY    : key1
VALUE  : ['path1', 'path2']
-------------------
KEY    : key_file
VALUE  : ['file1', 'file2']
-------------------

Python remove duplicate values of one key in dict

5 Answers5