1

I have a dataframe which looks like this.

s_id  h_id   h_val  h_others
1      600     5    {700,500}
1      700     12   {600,500,400}
1      500     6    {600,700}
2     ...     ...    ...

What I want to do is, when grouped by s_id, iterate over h_others, see if each id in the dictionaries is found in h_id for this particular s_id. If it is found, I want to map its value which can be found in h_val, add them up, and create a new column with the sum of the mapped values of h_others. If it is not found, the id can just be mapped to 0, so that it doesn't impact the sum.

Expected output:

s_id  h_id   h_val  h_others       sum_h_others
1      600     5    {700,500}       18     
1      700     12   {600,500,400}   11
1      500     6    {600,700}       17     
2     ...     ...    ...
Ozil
  • 21
  • 3

2 Answers2

1

Let's borrow an unnesting function from @WeNYoBen, but modify it slightly so it works with your sets. Then the calculation can then be done with a merge.

from itertools import chain 

def unnesting(df, explode):
    idx = df.index.repeat(df[explode[0]].str.len())
    df1 = pd.concat([
        pd.DataFrame({x: [*chain.from_iterable(df[x].to_numpy())]}) for x in explode], axis=1)
    df1.index = idx

    return df1.join(df.drop(explode, 1), how='left')

df1 = unnesting(df, explode=['h_others'])

s = (df1.reset_index().merge(df.reset_index(), 
                             left_on=['h_others', 's_id'], 
                             right_on=['h_id', 's_id'])
         .query('index_x != index_y')
         .groupby('index_x').h_val_y.sum())

df['sum_h_others'] = s

Output:

   s_id  h_id  h_val         h_others  sum_h_others
0     1   600      5       {700, 500}            18
1     1   700     12  {600, 500, 400}            11
2     1   500      6       {600, 700}            17

A more straight forward option is to map after unnesting, but the apply makes this slower:

d = {(k1, k2): v for k1, k2, v in zip(*df[['s_id', 'h_id', 'h_val']].to_numpy().T)}
#{(1, 500): 6, (1, 600): 5, (1, 700): 12}

df['sum_h_others'] = df1[['s_id', 'h_others']].apply(tuple, 1).map(d).groupby(level=0).sum()
ALollz
  • 57,915
  • 7
  • 66
  • 89
  • 1
    Nice answer +1. Maybe worth copy/pasting the borrowed function definition into the answer to make it easier to follow? The linked question has a lot of content, including two `unnesting` definitions, only one of which works in this case. It's an interesting read though so maybe it's a good thing that readers are forced to go look at the link :) – sjw Jul 10 '19 at 16:54
  • This works. However, could you please explain what's going on after the unnesting? The code itself is a little beyond my understanding. Thanks! – Ozil Jul 10 '19 at 18:14
1

Here is a possible way to do this:

import pandas as pd
import ast
from io import StringIO
df = pd.read_table(StringIO("""s_id  h_id   h_val  h_others
1      600     5    {700,500}
1      700     12   {600,500,400}
1      500     6    {600,700}"""), sep='\s+')

summs = []
for s_id, s in list(zip(df.s_id, df.h_others.values)):
    df['sum_h_others'] = 0
    summ = 0
    for d in ast.literal_eval(s):
        try:
            summ += sum(df.loc[df['s_id'] == s_id].loc[(df['h_id'] == d), 'h_val'].values)
        except IndexError:
            pass
    summs.append(summ)
df['sum_h_others'] = summs

Output:

   s_id  h_id  h_val       h_others  sum_h_others
0     1   600      5      {700,500}            18
1     1   700     12  {600,500,400}            11
2     1   500      6      {600,700}            17
cosmic_inquiry
  • 2,557
  • 11
  • 23