0

I am trying to build a table from a dataframe in python that shows the total of common frequencies of words that between two particular categories. In order to do this, I have built first a default dict that contains each category as the key and the list of words that pertain to that category at the value.

Now, I need to for each combination of two categories build a table that demonstrates the commonalities for a final desired result table such as:

  A B C
A 10 2 1
B 2 5 2
C 1 2 3

The sample data that I am working with is as:

Cat Item
A dog
A cat
A bear
A fish
A monkey
A tiger
A lion
A rabbit
A horse
A turtle
B dog
B cat
B flower
B plant
B bush
C dog
C flower
C plant

The working code that I am using is:

import pandas as pd
import numpy as np
from collections import defaultdict


inFile = '\path\to\infile.csv'

data = pd.read_csv(inFile, sep='\t')
dicts = defaultdict(list)

for i, j in zip(data['Cat'],data['Item']):
    dicts[i].append(j)


for k,v in dicts.iteritems():
    set1 = set(v)
    set2 = set(v)
    for k in set1.intersection(set2):
        print k,v

After running the above: the result default dict (before intersection) is the following

{'A':['dog','cat','bear','fish','monkey','tiger','lion','rabbit','horse','turtle'],'B':['dog','cat','flower','plant','bush'],'C':['dog','flower','plant']}

Trying to research this problem, I came across the following solution:, which is a step in the right direction, as it is counting and grouping values according to keys in multple dics, however it does not take into account the union of values between each combination of keys of the dict.

I also have looked at some solutions for find matching keys or values, but the majority of them, such as HERE, only deal with instances of two dictionaries and not multiple dictionaries.

Thus, I am still stuck in how to count and sum the total of common elements between each combination of keys within MULTIPLE dicts.

owwoow14
  • 1,694
  • 8
  • 28
  • 43

1 Answers1

1

I have made a dictionary required, you can format its data into a table: Use the & operator for intersection, that's exactly what you need :-

>>> dicts = {'A':['dog','cat','bear','fish','monkey','tiger','lion','rabbit','horse','turtle'],'B':['dog','cat','flower','plant','bush'],'C':['dog','flower','plant']}
>>> dicts.items()
[('A', ['dog', 'cat', 'bear', 'fish', 'monkey', 'tiger', 'lion', 'rabbit', 'horse', 'turtle']), ('C', ['dog', 'flower', 'plant']), ('B', ['dog', 'cat', 'flower', 'plant', 'bush'])]
>>> dicts = {'A':['dog','cat','bear','fish','monkey','tiger','lion','rabbit','horse','turtle'],'B':['dog','cat','flower','plant','bush'],'C':['dog','flower','plant']}
>>> items = sorted(dicts.items())
>>> res = {}
>>> for i in range(len(items)) :
...     for j in range(i,len(items)) :
...             res[(items[i][0],items[j][0])] = len(set(items[i][1]) & set(items[j][1]))
...             res[(items[j][0],items[i][0])] = res[(items[i][0],items[j][0])]
...
>>> res
{('B', 'C'): 3, ('A', 'A'): 10, ('B', 'B'): 5, ('B', 'A'): 2, ('C', 'A'): 1, ('C', 'B'): 3, ('C', 'C'): 3, ('A', 'B'): 2, ('A', 'C'): 1}
>>>
Tushar Aggarwal
  • 827
  • 1
  • 10
  • 26
  • Thank you for your solution. It works yet upon closer inspection I noticed an error that I cannot understand why it happens from your code. When úsing your code on a larger dataset with 22 keys, values, the dictionary `res` returns a result where {('A','B'):x} != {('B','A'):x} Any ideas why? – owwoow14 Sep 15 '17 at 13:27
  • Can you share the dataset and result? So that I can re-create the error. – Tushar Aggarwal Sep 15 '17 at 13:34
  • It was correct. It was a later manipulation of my `dataframe` that was in correct. Answer accepted. – owwoow14 Sep 15 '17 at 15:47