This should be faster than comparing each element to each other as in your current code.
mydata = {'data mining': ['data', 'text mining', 'artificial intelligence'], 'neural networks': ['cnn', 'rnn', 'artificial intelligence'], 'data': [ 'text mining', 'artificial intelligence','data']}
compared_values = set()
referencekeys = {}
myresults = {}
comparator = lambda x : ''.join(sorted(x))
for key, value in mydata.items():
compvalue = comparator(value)
if not set([compvalue]).issubset(compared_values):
compared_values.update([compvalue])
referencekeys[compvalue] = key
myresults[key] = value
else:
if len(key) > len(referencekeys[compvalue]):
myresults[key] = myresults.pop(referencekeys[compvalue])
referencekeys[compvalue] = key
print(myresults)
Here i define a comparator sorting the strings in the list values and joining them. Not sure if it is more efficient than yours which use Counters.
I loop over the dictionary once and store the strings generated by the comparator in a set()
. Each iteration of the loop I check if the new comparator string is in the set. If not, I add it to the set for future reference and add the key - value pair to the final result dictionary. Otherwise I check the key lengths and change the key of the dict as shown here if the new key is longer. I also need another dictionary in which I switch the key - compvalue (compvalue are the keys and key are the values) in order to keep track of which is the key of each compared value.
Should be faster (I did not check the time) because I have a single loop. The equivalent of your second loop is set([compvalue]).issubset(compared_values)
and set
is more efficient than a for
loop for this kind of jobs.
Have a try and see if it helps.
EDIT
Another similar idea which does not use set
just came to my mind.
referencekeys = {}
myresults = {}
comparator = lambda x : ''.join(sorted(x))
for key, value in mydata.items():
compvalue = comparator(value)
try:
if len(key) > len(referencekeys[compvalue]):
myresults[key] = myresults.pop(referencekeys[compvalue])
referencekeys[compvalue] = key
except KeyError:
referencekeys[compvalue] = key
myresults[key] = value
print(myresults)
Here I just try the if
statement. If referencekeys[compvalue]
throws a KeyError
means that the code has not found yet a similar value. Otherwise check for the key length.
Again I did not check the execution times, so I am not sure which is more efficient. But the result is the same.
EDIT 2
Following comment request, to keep empty lists as they are is enough to wrap the body of the loop in an if
statement (here I use the first code, but the same idea can be implemented for the second).
for key, value in mydata.items():
if len(value) > 0:
compvalue = comparator(value)
if not set([compvalue]).issubset(compared_values):
compared_values.update([compvalue])
referencekeys[compvalue] = key
myresults[key] = value
else:
if len(key) > len(referencekeys[compvalue]):
myresults[key] = myresults.pop(referencekeys[compvalue])
referencekeys[compvalue] = key
else:
myresults[key] = value
No need to store the key in referencekeys
if len(value)
== 0. If the original data mydata
is a single dictionary, keys are unique. So it's guaranteed that you are not overwriting anything.
For example, if you have mydata = {'data mining': ['data', 'text mining', 'artificial intelligence'], 'neural networks': ['cnn', 'rnn', 'artificial intelligence'], 'data': [ 'text mining', 'artificial intelligence','data'], 'data bis':[], 'neural link':[]}
you will get: myresults = {'data mining': ['data', 'text mining', 'artificial intelligence'], 'neural networks': ['cnn', 'rnn', 'artificial intelligence'], 'data bis': [], 'neural link': []}