1

I have a dictionary as follows.

{'data mining': ['data', 'text mining', 'artificial intelligence'],
 'neural networks': ['cnn', 'rnn', 'artificial intelligence'],
 'data': [ 'text mining', 'artificial intelligence', 'data']}

I want to re-arrange the dictionary in the following way. i.e. remove entries that has similar values by considering the longest key.

{'data mining': ['data', 'text mining', 'artificial intelligence'],
 'neural networks': ['cnn', 'rnn', 'artificial intelligence']}

In other words, both data mining and data have similar values. Therefore, I remove one entry and make the longest word as the key of the new enrty. i.e. 'data mining': ['data', 'text mining', 'artificial intelligence'].

My current code is as follows.

import collections
compare = lambda x, y: collections.Counter(x) == collections.Counter(y)
myresults = {}
mydata = {'data mining': ['data', 'text mining', 'artificial intelligence'],
          'neural networks': ['cnn', 'rnn', 'artificial intelligence'],
          'data': [ 'text mining', 'artificial intelligence','data']}

    for key1, value1 in mydata.items():
        for key2, value2 in mydata.items():
            if compare(value1,value2):
                mykeys = [key1, key2]
                temp = {max((mykeys), key=len): value1}
                myresults.update(temp)

    print(myresults)

However, my real dictionary dataset has around 4 million of entries. So, I am wondering if there is an efficient way of doing this in python.

I am happy to provide more details if needed :)

Dima Tisnek
  • 11,241
  • 4
  • 68
  • 120
EmJ
  • 4,398
  • 9
  • 44
  • 105
  • 2
    How do you define "similar"? The same except for order? – juanpa.arrivillaga Jan 23 '19 at 01:02
  • @juanpa.arrivillaga see compare function in my code. I am using it to see if the values are equal. :) – EmJ Jan 23 '19 at 01:03
  • @juanpa.arrivillaga the list values are about 15 -20 elements maximum :) – EmJ Jan 23 '19 at 01:05
  • So values are lists. What if a list has the same elements of another list plus some extra elements? Such as `['one', 'two']` and `['one', 'two', 'three']`? Do you consider them similar (and remove which one) or not? – Valentino Jan 23 '19 at 01:37
  • 1
    @Valentino No that is not similar. In order to be similar the lists need to have identical elements :) – EmJ Jan 23 '19 at 01:48

3 Answers3

2

You can first sort the dictionary by length, so the longer keys are guaranteed to occur first.

from itertools import groupby

d = {
    "data mining": ["data", "text mining", "artificial intelligence"],
    "neural networks": ["cnn", "rnn", "artificial intelligence"],
    "data": ["text mining", "artificial intelligence", "data"],
}

result = dict(
    g
    for k, (g, *_) in groupby(
        sorted(d.items(), key=lambda x: len(x[0]), reverse=True),
        key=lambda x: sorted(x[1]),
    )
)

It's also only one line, which is always nice! :)

Printing result yields:

{'neural networks': ['cnn', 'rnn', 'artificial intelligence'],
 'data mining': ['data', 'text mining', 'artificial intelligence']}
Dima Tisnek
  • 11,241
  • 4
  • 68
  • 120
iz_
  • 15,923
  • 3
  • 25
  • 40
  • thanks a lot for the answer and its impressive that you did all in one line :) However, I want the key to be `data mining` not `data`. i.e the key should be the longest string :) – EmJ Jan 23 '19 at 02:44
  • 1
    @Emi Whoops, forgot the `reverse=True`. It's fine now. – iz_ Jan 23 '19 at 02:45
1

This should be faster than comparing each element to each other as in your current code.

mydata = {'data mining': ['data', 'text mining', 'artificial intelligence'], 'neural networks': ['cnn', 'rnn', 'artificial intelligence'], 'data': [ 'text mining', 'artificial intelligence','data']}

compared_values = set()
referencekeys = {}
myresults = {}

comparator = lambda x : ''.join(sorted(x))

for key, value in mydata.items():
    compvalue = comparator(value)
    if not set([compvalue]).issubset(compared_values):
        compared_values.update([compvalue])
        referencekeys[compvalue] = key
        myresults[key] = value
    else:
        if len(key) > len(referencekeys[compvalue]):
            myresults[key] = myresults.pop(referencekeys[compvalue])
            referencekeys[compvalue] = key

print(myresults)

Here i define a comparator sorting the strings in the list values and joining them. Not sure if it is more efficient than yours which use Counters.

I loop over the dictionary once and store the strings generated by the comparator in a set(). Each iteration of the loop I check if the new comparator string is in the set. If not, I add it to the set for future reference and add the key - value pair to the final result dictionary. Otherwise I check the key lengths and change the key of the dict as shown here if the new key is longer. I also need another dictionary in which I switch the key - compvalue (compvalue are the keys and key are the values) in order to keep track of which is the key of each compared value.

Should be faster (I did not check the time) because I have a single loop. The equivalent of your second loop is set([compvalue]).issubset(compared_values) and set is more efficient than a for loop for this kind of jobs.

Have a try and see if it helps.

EDIT

Another similar idea which does not use set just came to my mind.

referencekeys = {}
myresults = {}

comparator = lambda x : ''.join(sorted(x))

for key, value in mydata.items():
    compvalue = comparator(value)
    try:
        if len(key) > len(referencekeys[compvalue]):
            myresults[key] = myresults.pop(referencekeys[compvalue])
            referencekeys[compvalue] = key
    except KeyError:
        referencekeys[compvalue] = key
        myresults[key] = value

print(myresults)

Here I just try the if statement. If referencekeys[compvalue] throws a KeyError means that the code has not found yet a similar value. Otherwise check for the key length.

Again I did not check the execution times, so I am not sure which is more efficient. But the result is the same.

EDIT 2

Following comment request, to keep empty lists as they are is enough to wrap the body of the loop in an if statement (here I use the first code, but the same idea can be implemented for the second).

for key, value in mydata.items():
    if len(value) > 0:
        compvalue = comparator(value)
        if not set([compvalue]).issubset(compared_values):
            compared_values.update([compvalue])
            referencekeys[compvalue] = key
            myresults[key] = value
        else:
            if len(key) > len(referencekeys[compvalue]):
                myresults[key] = myresults.pop(referencekeys[compvalue])
                referencekeys[compvalue] = key
    else:
        myresults[key] = value

No need to store the key in referencekeys if len(value) == 0. If the original data mydata is a single dictionary, keys are unique. So it's guaranteed that you are not overwriting anything.

For example, if you have mydata = {'data mining': ['data', 'text mining', 'artificial intelligence'], 'neural networks': ['cnn', 'rnn', 'artificial intelligence'], 'data': [ 'text mining', 'artificial intelligence','data'], 'data bis':[], 'neural link':[]} you will get: myresults = {'data mining': ['data', 'text mining', 'artificial intelligence'], 'neural networks': ['cnn', 'rnn', 'artificial intelligence'], 'data bis': [], 'neural link': []}

Valentino
  • 7,291
  • 6
  • 18
  • 34
  • I am using the first solution that you proposed. Just wondering how to adjust your code for dictionary entries with empty lists (e.g.,`{'deep learning': [], 'machine learning': [], 'text mining: []'}`. When the list is `empty` I do not want to do the mapping (map all the three keys in the example for the longest key). Instead I want the empty lists as it is (i.e. `{'deep learning': [], 'machine learning': [], 'text mining: []'}`). Please let me know if my explaination is not clear. I can give you more examples if needed – EmJ Jan 23 '19 at 10:42
  • 1
    I've edited the answer to show you this. Hope I understood the problem. – Valentino Jan 23 '19 at 12:22
1

Python built-in types to the rescue!

tmp = dict()
for topic, words in data.items():
    ww = frozenset(words)
    tmp[ww] = max(tmp.get(ww, topic), topic, key=len)
result = {topic: list(ww) for ww, topic in tmp.items()}
Dima Tisnek
  • 11,241
  • 4
  • 68
  • 120