1

I have 2 (or more dictionaries), each dictionary extracted and processed from a source.

The dictionary is of the format word : count

Let us say, from document No. 1, this is the dictionary that I extract:

dic1 = {'hello' : 1, 'able' : 3, 'of' : 9, 'advance' : 2, 'occurred' : 4, 'range' : 1}

And, from document No. 2, this is the dictionary:

dic2 = {'of' : 6, 'sold' : 4, 'several' : 3, 'able' : 2, 'advance' : 1}

I want to combine the two dictionaries such that

  1. Combine them such that if the words intersect, add up their values. This seems fairly do-able, from this question
  2. Combine them such that if the words intersect, append the document numbers for them. (I would also like to get a count, but that can be done by just taking the length of this new array)

For 1. a sample output would be:

 dictop1 = {'hello' : 1, 'able' : 5, 'of' : 15, 'advance' : 3, 'occurred' : 4, 'range' : 1, 'sold' : 4, 'several' : 3}

For 2. a sample output would be:

 dictop2 = {'hello' : [1], 'able' : [1,2], 'of' : [1,2], 'advance' : [1,2], 'occurred' : [1], 'range' : [1], 'sold' : [2], 'several' : [2]}

I will be iterating through thousands of such dictionaries, and doing the operations I mentioned above.

At the end, I require a dataframe of the following format:

Word | Count | DocsOccuredIn

How would I go about doing this?

One possible solution, is to find the two dictionaries I mentioned above separately, create 2 dataframes and merge them. In that case, how can I obtain the second dictionary. Or, is there a better way to approach this problem?

OlorinIstari
  • 537
  • 5
  • 20

3 Answers3

2

(1) Use defaultdict to handle a dictionary of lists, and use Counter to count,

from collections import defaultdict, Counter

dic_list = [dic1, dic2]

df_dict = {'Count':Counter(), 'DocsOccuredIn':defaultdict(list)}

for i, dic in enumerate(dic_list, 1):
    for key, val in dic.items():
        df_dict['Count'][key] += val
        df_dict['DocsOccuredIn'][key].append(i)

pd.DataFrame(df_dict).rename_axis('Word').reset_index()

(2) Use Pandas

dic_list = [dic1, dic2]

df = pd.DataFrame(dic_list).rename(lambda x:x+1)

df_dict = {'Count': df.sum().astype(int), 
           'DocsOccuredIn': df.notna().apply(lambda x:df.index[x].tolist())}

output = (pd.DataFrame(df_dict)
            .rename_axis('Word')
            .reset_index())
Mark Wang
  • 2,623
  • 7
  • 15
1
dic1 = {'hello' : 1, 'able' : 3, 'of' : 9, 'advance' : 2, 'occurred' : 4, 'range' : 1}
dic2 = {'of' : 6, 'sold' : 4, 'several' : 3, 'able' : 2, 'advance' : 1}

out1, out2 = {}, {}
for k in dic1.keys() | dic2.keys():
    out1[k] = dic1.get(k, 0) + dic2.get(k, 0)
    out2.setdefault(k, []).extend( ([1] if k in dic1 else []) + ([2] if k in dic2 else []) )

df = pd.DataFrame({'Word': list(out1.keys()), 'Count': list(out1.values()), 'DocsOccuredIn': list(out2.values()) })

print(df)

Prints:

       Word  Count DocsOccuredIn
0   several      3           [2]
1      sold      4           [2]
2     hello      1           [1]
3   advance      3        [1, 2]
4      able      5        [1, 2]
5        of     15        [1, 2]
6  occurred      4           [1]
7     range      1           [1]
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • Thanks! Works perfectly. Is ```|``` binary or? What exactly are you doing with that operation? Is there a loop way of doing this? I wish to extend it to a third dictionary. In the sense that, use this output, and combine it with a third dictionary – OlorinIstari May 24 '20 at 11:28
  • @ShrutheeshRamanIyer `|` returns all keys in `dict1` and `dict2` (just like in sets) [more here](https://docs.python.org/3.8/library/stdtypes.html#dictionary-view-objects). – Andrej Kesely May 24 '20 at 11:31
0

Use:

d = pd.concat(map(pd.Series, [dic1, dic2]), axis=1) # here you can use any number of dictionaries as required

df = pd.DataFrame({
    'Word': d.index.values,
    'Count': d.sum(axis=1).astype(int).values,
    'DocsOccuredIn':  d.agg(lambda s: (s.index[~s.isna()] + 1).values, axis=1).values})

Intermediate Steps:

# d
           0    1
hello     1.0  NaN
able      3.0  2.0
of        9.0  6.0
advance   2.0  1.0
occurred  4.0  NaN
range     1.0  NaN
sold      NaN  4.0
several   NaN  3.0

# d.sum(axis=1).astype(int)
hello        1
able         5
of          15
advance      3
occurred     4
range        1
sold         4
several      3
dtype: int64


# d.agg(lambda s: (s.index[~s.isna()] + 1).values, axis=1)
hello          [1]
able        [1, 2]
of          [1, 2]
advance     [1, 2]
occurred       [1]
range          [1]
sold           [2]
several        [2]
dtype: object

Result:

# print(df)

       Word  Count DocsOccuredIn
0     hello      1           [1]
1      able      5        [1, 2]
2        of     15        [1, 2]
3   advance      3        [1, 2]
4  occurred      4           [1]
5     range      1           [1]
6      sold      4           [2]
7   several      3           [2]
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53