4

I am trying to find an efficient solution for following problem:

I have a list of dictionaries with each dictionary having the same set of keys as another one. The associating values can be equal inter-dictionaries. I am trying to find the minimum number of keys and their associating values which would make each dictionary unique.

For example for a list consisting of three dictionaries:

list = [a, b, c]

where 

a = {"key1": "alpha", "key2": "beta", "key3": "gamma"}
b = {"key1": "alpha", "key2": "beta", "key3": "eta"}
c = {"key1": "alpha", "key2": "zeta", "key3": "eta"}

All three dictionaries have the same value for key1 therefore this key can be eliminated since its inclusion does not determine the uniqueness of a dictionary. On the other hand, key2 and key3 must be both included since their collective make the respective dictionary unique.

a = {"key2": "beta", "key3": "gamma"}
b = {"key2": "beta", "key3": "eta"}
c = {"key2": "zeta", "key3": "eta"}

I assume that I loop through the list of dictionaries so using for example collections.Counter within an iteration is possible. The number of dictionaries within the list is a variable along with the number of keys. I would like to iterate through the list the least number possible (for example once while updating one or multiple Counters?). I am fairly sure that there is an appropriate algorithm for this problem but could not find it with my search keywords.

Edit: Each final dict must have the same keys as the others. Therefore keeping a different set of keys for each individual dict is not an option.

Tony.H
  • 631
  • 1
  • 6
  • 14
  • Does each `dict` have the same `keys`? – Raj Dec 20 '19 at 09:25
  • Why isn't `a` just `gamma`, and `b` just `beta` and `c` just `zeta`? – blueteeth Dec 20 '19 at 09:27
  • @Raj Yes each dict has the same keys – Tony.H Dec 20 '19 at 09:32
  • @blueteeth I will edit my question to highlight this aspect but each final dict must have the same keys as the others. Therefore keeping keeping key3 for a and key2 for b and c is not an option. – Tony.H Dec 20 '19 at 09:33
  • 1
    So is the question really how to remove keys that are the same in all variables? – blueteeth Dec 20 '19 at 09:36
  • @blueteeth the final operation of one approach could effectively be to remove the redundant keys but you would first have to find these redundant ones – Tony.H Dec 20 '19 at 09:39
  • @Tony.H have you created an inefficient algorithm yet? If so, could you post what you've tried? – dx_over_dt Dec 20 '19 at 09:59
  • @dx_over_dt I started with an implementation where I would remove a key in sequence for each dictionary basically analysing the problem for each dictionary individually. However, this felt so inefficient that I stopped half way through (writing this implementation also just felt wrong haha). I then started looking into set theories/algorithms before making this post. – Tony.H Dec 20 '19 at 10:16
  • Could you add the code for this algorithm? It's been a long time since I used python, and it would be easier for me to see what you've got already so that I can write an answer. – dx_over_dt Dec 20 '19 at 10:21
  • @Tony.H Is there any prior information available regarding the distribution of data in each `dict`? It would be helpful in developing some ad-hoc pruning. – Raj Dec 20 '19 at 10:21
  • @dx_over_dt yes I will rewrite and post it then but feel free to write to suggest something in any other language – Tony.H Dec 20 '19 at 10:24
  • @Raj could you please elaborate on your distribution aspect? Distribution of data rings like stats to me – Tony.H Dec 20 '19 at 10:26
  • @Tony.H do all of the values in the dictionary have the same properties defined? Or do some properties only exist on some of the items, requiring that they not be included as part of the final unique key? Also, is it possible that some items are exact duplicates? – dx_over_dt Dec 20 '19 at 11:17
  • The values are unknown beforehand so some properties might exist only on some for one case but be shared in another case. Each item (dict) is unique so duplicates cannot happen – Tony.H Dec 20 '19 at 12:51
  • What's the expected result for `a = {k1: "a", k2: "b", k3: "c"}`; `b = {k1: "a", k2: "b", k3: "c"}`; `c = {k1: "a", k2: "b", k3: "c"}`? – גלעד ברקן Dec 20 '19 at 18:30

4 Answers4

2

An exact solution is NP-hard, but for a decent approximation, you can try a variant of the ID3 algorithm for creating decision trees: https://en.wikipedia.org/wiki/ID3_algorithm

The difference in your case is that you have to pick the same attribute across all branches, so it would work like this:

  1. Start with one set of all dictionaries
  2. For each attribute, calculate the sum of the entropies of its values across all sets. The formula is in the linked article.
  3. Partition the sets according to the selected attribute, and discard all the sets that contain only one dictionary
  4. If there are still sets to partition, go back to (2)
Matt Timmermans
  • 53,709
  • 3
  • 46
  • 87
1

This problem is NP-complete, by reduction to and from the set cover problem. Given an instance of your problem, we can construct a polynomial-sized instance of the set cover problem in polynomial time, and vice versa.

  • To reduce your problem to set cover, take the set of all unordered pairs like (a,b), (a,c), (b,c); and for each key, construct the set of pairs which that key distinguishes. A minimal set of keys distinguishing all pairs of the original dictionaries is a minimal choice of these sets.

  • To reduce set cover to your problem, given a set { 1, 2, ..., n } and a collection of subsets, construct 2n dictionaries named a1, b1, a2, b2, ..., an, bn. For each subset, add a key such that this key has the value 1 in each dictionary bk where k is in the subset, and the value 0 in every other dictionary. Add one more key which has the value k in each dictionary ak and bk. A minimal set of keys distinguishing all pairs necessarily includes that last key, but the remaining keys correspond with a minimal choice of sets for the original set cover instance.

So, therefore there is no known algorithm which solves your problem in polynomial time. Your problem can be solved by a backtracking search, but you are unlikely to find a much more efficient algorithm than is possible by backtracking.

kaya3
  • 47,440
  • 4
  • 68
  • 97
1

I'm glad other answers confirmed what I suspected, that this is an NP-complete problem. There is currently no known way of getting around, in the worst case, trying every possible subset of keys.

Here's my algorithm, which runs in O(n^2*2^k) time and O(nk^2+2^k) space, where n is the number of items in your list, and k is the number of properties on each item.

As long as 2^k < n^2, this runs in roughly polynomial time.

def get_unique_key_values(objs):
    key = get_unique_key(objs)
    return [{ k: obj[k] for k in key } for obj in objs ]

def get_unique_key(objs):
    return get_unique_key_set(objs, { k for obj in objs for k in obj }, [])

def get_unique_key_set(objs, keys, tested_keys):
    if len(keys) == 0 or not all_unique(objs):
        # keys is either the empty set, or this subset of keys
        # does not guarantee uniqueness
        return False

    # the smallest number of keys required for a unique key
    best_key_set = set(keys)

    # delete each key one at a time and check if the list of 
    # items are still unique
    for del_key in keys:
        tmp_keys = set(keys)
        tmp_keys.remove(del_key)

        # if we've already tested this subset, skip it and all its children
        if tmp_keys in tested_keys:
            continue

        # keep track of subsets we've tested so we don't retest them--significant trimming
        tested_keys.append(tmp_keys)

        # generate a list of objects with only the current set of keys
        tmp_objs = [{ k: obj[k] for k in tmp_keys } for obj in objs]

        # continue to delete keys from the current subset until we find a subset
        # of size 1, or the current tmp_keys is optimal
        tmp_key_set = get_unique_key_set(tmp_objs, tmp_keys, tested_keys)
        if tmp_key_set == False:
            continue

        if len(tmp_key_set) < len(best_key_set):
            best_key_set = tmp_key_set

    return best_key_set

# O(n^2) algorithm for checking that every element in the list is unique 
def all_unique(objs):
    for i in range(len(objs) - 1):
        for j in range(i + 1, len(objs)):
            if objs[i] == objs[j]:
                return False

    return True

objects = [
    { 'a': 1, 'b': 2, 'c': 2 },
    { 'a': 1, 'b': 3, 'c': 2 },
    { 'a': 1, 'b': 3, 'c': 3 }
]

print(get_unique_key(objects))
# prints set([ 'b', 'c' ])

objects = [
    { 'a': 1, 'b': 2, 'c': 2 },
    { 'a': 2, 'b': 3, 'c': 2 },
    { 'a': 3, 'b': 3, 'c': 3 }
]

print(get_unique_key(objects))
# prints set([ 'a' ])

I made some assumptions when writing this script, such as that all objects have the same set of properties. If some properties only exist on some objects, you may have to alter the script some.

You could speed this up a bit by changing tested_keys to a set and creating a hash function for arrays of keys, storing the hash in the set.

Creating a hash function for the object dictionary could turn all_unique into an O(n) algorithm, reducing the total running time to O(n2^k). Ironically, while significantly reducing the running time, it increases the likelihood that it will be an exponential time algorithm, since 2^k < n is harder to satisfy.

See this answer for how to create these hashes.

dx_over_dt
  • 13,240
  • 17
  • 54
  • 102
0

If each final dict must have the same keys as all other dicts, the only solution can be to remove keys that are the same in all dicts.

You could do this by looping through the first dict and adding keys which are all the same to a list. Then at then end remove all keys in the saved lists from the dicts.

def process_lists(lsts):
    first = lsts[0]
    to_remove = []
    for key in first:
        if all(first[key] == o[key] for o in lsts[1:]):
           to_remove.append(key)
    return [{k: v for k, v in lst.items() if k not in to_remove} for lst in lsts]

process_lists([a, b, c])
blueteeth
  • 3,330
  • 1
  • 13
  • 23
  • Not necessarily. Suppose `a` is composed of unique values, like an id, and `` compose unique values. `a` is sufficient even though neither `b` nor `c` are the same for all elements. – dx_over_dt Dec 20 '19 at 09:55
  • @dx_over_dt In that case is it possible to do anything better than iterating over all`2^n` subset of keys? – Raj Dec 20 '19 at 10:00
  • @Raj that's the only algorithm I can think of, with some trimming. First we'd remove every key that does not exist in every element. The rest would be a fairly nasty depth-first search where each depth removes an additional key. As long as the resultant keys form a dictionary of the same number of elements, keep removing keys. Keep track of the largest depth and its non-removed keys. After you've completed the search, the list of non-removed keys is the final answer, and you then iterate through the original dictionary pulling the key/value pairs into their own dictionary. – dx_over_dt Dec 20 '19 at 10:07
  • There's also a breadth-first approach where you recursively add keys until the resulting dictionary contains the same number of elements as the original dictionary. This requires more memory as you'd need to keep track of the intermittent dictionaries. Which approach is faster depends on the number of keys in the final answer. You *might* be able to run both algorithms in parallel, where you add the current key in the bfs approach and remove the current key from the dfs approach. Once one algorithm yields an answer, you're finished. – dx_over_dt Dec 20 '19 at 10:27