0

I have to write three dictionaries from the same csv file. The input file is

col1   col2   value
item1  a      value1
item1  b      value2
item1  c      value3
item2  a      value4
item2  c      value5
...

And I need these three dictionaries:

1.

dict1
item1:set(a,b,c)
item2:set(a,c)
...

2.

dict2
set(item1,a):value1
set(item1,b):value2
set(item1,c):value3
set(item2,a):value4
set(item2,c):value5

I need to use sets as values in the first dictionary because then I will have to perform intersections between values and I think set is the more suitable type.

My final dictionary, resulting from these intersections, will be something like:

3.

dict3
(item1,item2):value1+value3 

It is probably easier to understand just by looking at the examples, but let me explain it: basically dict3 considers the pairwise intersections between the values of dict1, which in my example is only a, and then does dict2.get((item1,a))+dict2.get((item2,a)) and assigns it as value to the couple (item1,item2). If item1 and item2 had in common another element, let's say d, the value for (item1,item2) would then be dict2.get((item1,a))+dict2.get((item2,a))+dict2.get((item1,d))+dict2.get((item2,d)). Please note that in the real dataset col1 and col2 items are strings. This calculation is repeated checking every pairwise intersection of values in dict1.

What's the easiest way to get these dictionaries? I am more comfortable using pandas, so I'd ask you to suggest solutions using a dataframe, but I can accept anything which reads directly from the external file as well, since this comes into play only in the very first stage.

EDIT I should probably clarify better that I need a pairwise intersection, and this issue doesn't arise with the example I gave. Just to have a better example on which one can work, try:

df=pd.DataFrame(columns=['col1','col2','value']) 
df.col1=['item1','item1','item1','item2','item2','item3','item3'] 
df.col2=['a','b','c','a','d','a','c'] 
df.value=[1,2,3,4,5,6,7] 

and try to get as a result:

dict3
(item1,item2):5
(item1,item3):17
(item2,item3):10

It seems like a very complex problem: I found something on pairwise set intersection here but I can't find a final solution.

sato
  • 768
  • 1
  • 9
  • 30

1 Answers1

3

The following works if you don't care about how many items match up (see below for how to do this with only pairwise matching):

In [1]: df
Out[1]:
    col1 col2  value
0  item1    a      1
1  item1    b      2
2  item1    c      4
3  item2    a      8
4  item2    d     16

In [2]: dict1 = df.groupby('col1').apply(lambda vals: set(vals.col2))

In [3]: dict1
Out[3]:
col1
item1    {b, c, a}
item2       {a, d}
dtype: object

In [4]: dict3 = {tuple(sorted(vals.col1)): sum(vals.value)
                 for kind, vals in df.groupby('col2')
                 if kind in set.intersection(*[dict1[itm] for itm in vals.col1])
                 and len(vals) > 1}

In [5]: dict3
Out[5]: {('item1', 'item2'): 9}

EDIT:

To do this in just a pairwise fashion, let's use a modified join:

df.reset_index(inplace=True)

merged = pd.merge(df, df, on='col2')
merged = merged[merged.index_x < merged.index_y]

idxs, vals = zip(*[(tuple(sorted([row.col1_x, row.col1_y])), row.value_x + row.value_y)
    for row in merged.itertuples(False)])
final = pd.DataFrame(list(vals), index=list(idxs))

Our values are:

In [1]: df
Out[1]:
   index   col1 col2  value
0      0  item1    a      1
1      1  item1    b      2
2      2  item1    c      3
3      3  item2    a      4
4      4  item2    d      5
5      5  item3    a      6
6      6  item3    c      7

In [2]: merged
Out[2]:
    index_x col1_x col2  value_x  index_y col1_y  value_y
1         0  item1    a        1        3  item2        4
2         0  item1    a        1        5  item3        6
5         3  item2    a        4        5  item3        6
11        2  item1    c        3        6  item3        7

In [3]: final
Out[3]:
                 0
(item1, item2)   5
(item1, item3)   7
(item2, item3)  10
(item1, item3)  10

The merge operation might soak up a fair bit of memory depending on your data size. Pandas doesn't support a more complex merge function that would take our inequality into account as well, and as far as I know there's no memory-efficient one-step way to do this without coding the join function yourself (which you could do with a dictionary comprehension--see below). Really, though, if data size is that much of an issue, you might want to consider using Spark dataframes instead, which should be much more scalable (I don't know if that's necessarily true if you're only on one computer, but I suspect it is) without requiring dramatically different code.

To do this with a dictionary comprehension, I think your code could look something like the following:

final_dict = {
    tuple(sorted((a.col1, b.col1))): a.value + b.value
    for a_index, a in df.iterrows()
    for b_index, b in df.iterrows()
    if a_index < b_index
    and a.col2 == b.col2
}

# {('item1', 'item2'): 5, ('item1', 'item3'): 10, ('item2', 'item3'): 10}

Note how the dictionary invisibly overwrites duplicate key entries, whereas Pandas attempts to retain them. Just an ambiguity for you to decide how you want resolved.

scnerd
  • 5,836
  • 2
  • 21
  • 36
  • Sry, the collision between `a` and `c` was causing only `c`'s value to come out in the final dictionary. Fixed it to better reflect your sample output (by tweaking your sample input slightly) – scnerd Mar 12 '18 at 18:16
  • More or less: this way the output in dict3, considering that in the real dataset I have more items, is `(item1, item2,item3,...item n):###` because I think it takes the intersections between all these items and sums up the values. But I need specifically two items in the keys, because in the end it must be some sort of dictionary of pairwise interactions – sato Mar 13 '18 at 09:04
  • `df=pd.DataFrame(columns=['col1','col2','value']) df.col1=['item1','item1','item1','item2','item2','item3','item3'] df.col2=['a','b','c','a','d','a','c'] df.value=[1,2,3,4,5,6,7]` – sato Mar 13 '18 at 09:52
  • Thanks @scnerd ! it does exactly what I need if in the end I call a final.groupby().sum . I'll mark this as right answer but I actually have memory issues: the dataset is pretty big and 800gb of ram are not enough. I'll try spark dataframes even though I have never used it. – sato Mar 13 '18 at 20:35
  • 1
    Glad to hear that helps. Yeah, when you start talking 100's of GB's of data, probably best to move to a distributed framework of some sort (such as Spark). Pandas is powerful for small scale data, but it's not engineered to handle truly large scale problems. If you can use PySpark's dataframe API, the transition shouldn't be too hard. Best of luck. – scnerd Mar 13 '18 at 20:46
  • For reference, the dictionary implementation I gave should also be a lot easier on memory. If this is just a one-off transformation, you might give that a shot and convert the final result back into a dataframe. – scnerd Mar 13 '18 at 20:47
  • I'll try, and maybe ask again if I have some particular issue :D thanks again! – sato Mar 13 '18 at 20:47
  • yes I still haven't tried to run the dict comprehension. I'll let you know the outcome. – sato Mar 13 '18 at 20:48
  • how would you solve the ambiguity if you wanted the duplicate key entries? appending duplicate's values to a single list using a defaultdict? I'm not finding it easy with list comprehension – sato Mar 14 '18 at 09:49
  • You can do a list comprehension into a tuple rather than a dictionary comprehension would be the easiest starting point. From there you can decide how to convert the list of tuples into your desired data structure. – scnerd Mar 14 '18 at 16:14