Pythonic way to group a list using a dictionary that has lists as values

Question

I'm looking for a Pythonic way or more efficient way to solve this problem. I have a dictionary which has sets as values (duplicates allowed across keys). Given a list I must create a dictionary which maps each category to the element using the key from the master dictionary. I'll give an example to illustrate.

Master Dictionary

{
    "KeyA": ['Aron', 'Ranom Value', 'Abhishek'],
    "KeyB": ['Ball', 'Foo', 'Bar', 'Badge', 'Dog'],
    "KeyZ": ['Random Value', 'Foo', 'Bar']
}

Input

['Foo', 'Bar', 'Dog', 'Aron']

Output

{
    "KeyA": ['Aron'],
    "KeyB": ['Bar', 'Foo', 'Dog'],
    "KeyZ": ['Foo', 'Bar']
}

My Current Thoughts

Invert individual items in the sets as keys and then do a lookup.

{
     'Aron'         : ['KeyA'],
     'Foo'          : ['KeyB', 'KeyZ'],
     'Bar'          : ['KeyB', 'KeyZ'],
     'Random Value' : ['KeyA', 'KeyZ']
}

I'd initialise the inverted dictionary by going through every item in every set. Approx time to create such a dictionary is O(n). Look for an item in the list in the inverted dictionary so created. Say the value Bar. Create a new dictionary using the information 'Bar': ['KeyB', 'KeyZ']. Resultant dictionary would be {'KeyB': ['Bar'], 'KeyZ': ['Bar']}. For the next item I'd have to do some bookkeeping on the existing dictionary like key exists or not, if yes then append to the existing list and so on.

Use the in operator on the set mapped (check for membership) to every key

The master dictionary and input list are going to be pretty small most of the times. (less than 500 unique items in all the sets together). So I could check for membership in the set returned by each key and create a dictionary. This is obviously less efficient but works for most cases.

I have a few more manipulations that are similar to the example given above. I don't want to do manual bookkeeping for all of them because they are error prone and slower than inbuilt functions.

What I need?

Better approaches (faster algorithm)
Inbuilt functions in itertools because these are faster
3rd Party Library
Some esoteric comprehensions that a normal Python user wouldn't think of?

@StefanPochmann because duplicates are allowed but highly unlikely that they'll occur. — Abhirath Mahipal, Dec 27 '17 at 07:59
If duplicates are allowed in value, you cannot use a set. A list is fine. — pylang, Dec 28 '17 at 23:28

score 5 · Accepted Answer · answered Dec 27 '17 at 07:00

How about you convert the list to a set before you start converting? Set-lookups are faster than linear search in lists.

input_set = set(input)

Once you have it, you can use regular dict-comprehension, in my opinion:

output = {key: [x for x in value if x in input_set] for key, value in master_dict.items()}

Result:

output == {'KeyB': ['Foo', 'Bar', 'Dog'], 'KeyA': ['Aron'], 'KeyZ': ['Foo', 'Bar']}

score 4 · Answer 2 · answered Dec 27 '17 at 07:12

4

one way is using intersection in python as follows:

x={
    "KeyA": ['Aron', 'Ranom Value', 'Abhishek'],
    "KeyB": ['Ball', 'Foo', 'Bar', 'Badge', 'Dog'],
    "KeyZ": ['Random Value', 'Foo', 'Bar']
   }
items = ['Foo', 'Bar', 'Dog', 'Aron']

{k:  set(items).intersection(set(v)) for k, v in x.items()}

answered Dec 27 '17 at 07:12

Amir Naimi

133
3
13

`set(items) & set(v)` – Stefan Pochmann Dec 27 '17 at 07:19
intersection is neat :) – Abhirath Mahipal Dec 27 '17 at 08:03

score 1 · Answer 3 · answered Dec 27 '17 at 07:25

How about with defaultdict and list comprehension.

from collections import defaultdict

result = defaultdict(list)

d = {
    "KeyA": ['Aron', 'Ranom Value', 'Abhishek'],
    "KeyB": ['Ball', 'Foo', 'Bar', 'Badge', 'Dog'],
    "KeyZ": ['Random Value', 'Foo', 'Bar']
   }
items = ['Foo', 'Bar', 'Dog', 'Aron']

[result[k].append(e) for k,v in d.items() for e in v if e in items]

print(result) # defaultdict(<type 'list'>, {'KeyB': ['Foo', 'Bar', 'Dog'], 'KeyA': ['Aron'], 'KeyZ': ['Foo', 'Bar']})

print(dict(result)) # {'KeyB': ['Foo', 'Bar', 'Dog'], 'KeyA': ['Aron'], 'KeyZ': ['Foo', 'Bar']}

RoadRunner · Answer 4 · 2017-12-28T02:27:09.383

1

Another Posssible Approach:

One way to possibly speed up the search time for checking if a value exists in input_set is to use binary search, which is O(logn).

Here is some example code which also uses the convienient collections.defaultdict:

from collections import defaultdict

master = {
          "KeyA": ['Aron', 'Ranom Value', 'Abhishek'],
          "KeyB": ['Ball', 'Foo', 'Bar', 'Badge', 'Dog'],
          "KeyZ": ['Random Value', 'Foo', 'Bar']
         }    

input_set = ['Foo', 'Bar', 'Dog', 'Aron']

sorted_list = sorted(input_set)

d = defaultdict(list)
for key, value in master.items():
    for v in value:
        if binary_search(sorted_list, v):
            d[key].append(v)

print(d)

Which Outputs:

defaultdict(<class 'list'>, {'KeyA': ['Aron'], 'KeyB': ['Foo', 'Bar', 'Dog'], 'KeyZ': ['Foo', 'Bar']})

Where binary_search() is defined below:

def binary_search(item_list,item):
    first = 0
    last = len(item_list)-1

    while first <= last:
        mid = (first + last)//2
        if item_list[mid] == item :
            return True
        elif item < item_list[mid]:
            last = mid - 1
        else:
            first = mid + 1 
    return False

The above code does seem like reinventing the wheel. You can have a look at the bisect module which provides some ways to call binary search without having to write your own function.

Note: In order to use binary search, you also need to sort the values before hand, which is O(nlogn). I'm not entirely sure how much impact this will have, you'd have to run some tests with another approach to see the difference.

Additionally, as @SuperSaiyan posted, converting input_set to a set is the most efficient approach, because set lookup O(1) in the best case, and O(n) in the worst case(rare).

edited Dec 28 '17 at 02:27

answered Dec 27 '17 at 08:29

RoadRunner

25,803
6
42
75

Thanks for your detailed answer :) – Abhirath Mahipal Dec 27 '17 at 09:53
It's a nice approach which is efficient with a little book keeping. My title has led people to think of a Pythonic way so I'll have to choose the best answer accordingly. Sorry about that :) – Abhirath Mahipal Dec 27 '17 at 10:04
1

@AbhirathMahipal No worries, I tried my best to give a good approach. Give the answer that you like the most a green tick. I thought this was an interesting problem, so I answered just for the fun of it. – RoadRunner Dec 27 '17 at 10:06
1

Will have a look at the bisect module. I am against using sets as it can contain duplicates but it's very unlikely to contain duplicates. – Abhirath Mahipal Dec 27 '17 at 10:18
1

Yeah, seems like an interesting problem you have here. Good luck with it, all these answers should provide you some good options for solving your problem. – RoadRunner Dec 27 '17 at 10:37
Come on... you can't ever seriously write `binary_search(sorted(input_set)`. Sorting for every binary search is completely self-defeating. You should sort only once, during setup. – Stefan Pochmann Dec 27 '17 at 15:10
Also, you definitely need to benchmark if you let a Python binary search compete against `in mylist` if the list is small, because likely the binary search is **slower** despite being logarithmic. See how even after pre-sorting, your binary search is 3.4 times *slower* than linear search with the given example data: https://ideone.com/FbZeEU (check the results on the bottom of the page). And even with the list enlarged to length 100, it is *still* significantly slower: https://ideone.com/rlGi86 – Stefan Pochmann Dec 27 '17 at 15:38
@StefanPochmann Yeah that was indeed silly, don't know why I did that. It also turns out that using binary search is indeed slower, which I didn't expect, even with the optimized `bisect` module. Also why be so harsh and downvote my answer? Seems a bit rude. – RoadRunner Dec 27 '17 at 16:25
Yeah, the `bisect` module isn't that much better. It's Python code, so it's Python speed. And it's for general sequences, so it accesses the list through the list object's general public interface. Asking a list whether it contains something is much lower level, it's C code and the list can access its elements more directly. That said, you *could* make your binary search faster and then it would start beating linear search earlier. – Stefan Pochmann Dec 27 '17 at 16:36
(continuing...) But my point is that this really really needs to be benchmarked with realistic data and discussed, at least when the list length is only up to hundreds. Simply stating that binary search speeds this up is misleading and potentially *harmful*, and that's why I downvoted. – Stefan Pochmann Dec 27 '17 at 16:40
Fair enough, I guess this answer was just full of mistakes. How would you optimize binary search here to be faster than `in`? I'm quite curious about this. – RoadRunner Dec 27 '17 at 16:45
Good, I see you added "possibly", that's enough for me to retract the downvote. Btw, when you "dismiss" the idea to convert `input_set` to a `set`, you make it sound like that might make it wrong. Do you think that? It wouldn't. – Stefan Pochmann Dec 27 '17 at 16:55
(continuing...) How to optimize binary search? Mostly by getting rid of the `==` test/branch. Until the range is very small, the chance for `==` to be true is small, so it's mostly wasting time. See [the bisect code](https://github.com/python/cpython/blob/32518b439b9590cce0ef0639e558dc1ce2e152bb/Lib/bisect.py#L82-L85]), which *is* faster. Also, you could embed the binary search code right into the main algorithm instead of putting it into a function that needs to be called and passed arguments. – Stefan Pochmann Dec 27 '17 at 16:56

score 1 · Answer 5 · answered Dec 28 '17 at 23:56

The OP proposed a reverse dictionary. It is arguably still pythonic, so here is how one can be implemented.

Given

import collections as ct


master_dict = {
    "KeyA": ['Aron', 'Random Value', 'Abhishek'],
    "KeyB": ['Ball', 'Foo', 'Bar', 'Badge', 'Dog'],
    "KeyZ": ['Random Value', 'Foo', 'Bar']
}

input_list = ['Foo', 'Bar', 'Dog', 'Aron']

Code

We use a collections.defaultdict to ease the creation of list values.

reverse_dict = ct.defaultdict(list)
for k, v in master_dict.items():
    for item in v:
        reverse_dict[item].append(k)
reverse_dict

Output

defaultdict(list,
            {'Abhishek': ['KeyA'],
             'Aron': ['KeyA'],
             'Badge': ['KeyB'],
             'Ball': ['KeyB'],
             'Bar': ['KeyB', 'KeyZ'],
             'Dog': ['KeyB'],
             'Foo': ['KeyB', 'KeyZ'],
             'Random Value': ['KeyA', 'KeyZ']})

Now that the inputs can be searched by key, lookups are faster than searching each list of strings. We build the final dictionary from an input list of lookup values.

final_dict = ct.defaultdict(list)
for v in input_list:
    for k in reverse_dict[v]:
        final_dict[k].append(v)

final_dict

Output

defaultdict(list,
            {'KeyA': ['Aron'],
             'KeyB': ['Foo', 'Bar', 'Dog'],
             'KeyZ': ['Foo', 'Bar']})

@SuperSaiyan proposed rebuilding lists for each key of the master dictionary by searching the set of the input list. This is a brilliant and superior approach for this particular application.

Pythonic way to group a list using a dictionary that has lists as values

My Current Thoughts

5 Answers5