3

I have two pandas DataFrame, which called df1 and df2. I want to sum the list values in df2 which the list's values come from df1.

For example:

df1:

df1 = pd.DataFrame([['a',11],['b',13],['c',45],['d',88]],columns=['name1','data1'])
df1

    name1   data1
0      a       11
1      b       13
2      c       45
3      d       88

and df2:

df2 = pd.DataFrame([['a',['b','c','d']],['b',['a','c']]],columns=['name2','data2'])
df2

    name2         data2
0      a      [b, c, d]
1      b         [a, c]

Finally, I want this:

    name2   data2
0      a      146
1      b       56

How? Thanks a lot.

running man
  • 1,377
  • 4
  • 14
  • 21

4 Answers4

3

First create dictionary by df1 and then list comprehension with get for map value of dict, if values not matched is added 0 to sum:

d = df1.set_index('name1')['data1'].to_dict()
df2['data2'] = [sum(d.get(y, 0) for y in x) for x in df2['data2']]
print (df2)

  name2  data2
0     a    146
1     b     56

If want remove NaNs is possible use filter with condition:

df1 = pd.DataFrame([['a',11],['b',13],['c',45],['d',np.nan]],columns=['name1','data1'])
print (df1)
  name1  data1
0     a   11.0
1     b   13.0
2     c   45.0
3     d    NaN

df2 = pd.DataFrame([['a',['b','c','d']],['b',['a','c']]],columns=['name2','data2'])

d = df1.set_index('name1')['data1'].to_dict()
df2['data2'] = [sum(filter(lambda v: v==v, (d.get(y, 0) for y in x))) for x in df2['data2']]
print (df2)

  name2  data2
0     a   58.0
1     b   56.0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
2

Can also do

d = dict(df1.values)
df2['s'] = df2.data2.transform(lambda v: pd.Series(v).map(d)).sum(1) 

0    146.0
1     56.0
dtype: float6

or

df2.data2.transform(lambda l: sum(d[i] for i in l))

0    146.0
1     56.0
dtype: float6
rafaelc
  • 57,686
  • 15
  • 58
  • 82
1

You can use pivot on df1 to get names into columns, then index into df2:

pivoted = df1.pivot(columns="name1").data1.sum()
df2.data2 = df2.data2.apply(lambda x: pivoted[x].sum())

  name2  data2
0     a  146.0
1     b   56.0
andrew_reece
  • 20,390
  • 3
  • 33
  • 58
1

You can use collections.defaultdict with dict.__getitem__:

from collections import defaultdict

d = defaultdict(int, df1.set_index('name1')['data1'].to_dict())

df2['sum'] = [sum(map(d.__getitem__, x)) for x in df2['data2']]

print(df2)

  name2      data2  sum
0     a  [b, c, d]  146
1     b  [a, c, e]   56

This will be more efficient than generator expressions for larger dataframes:

from collections import defaultdict

def jpp(df1, df2):
    d = defaultdict(int, df1.set_index('name1')['data1'].to_dict())
    return [sum(map(d.__getitem__, x)) for x in df2['data2']]

def jez(df1, df2):
    d = df1.set_index('name1')['data1'].to_dict()
    return [sum(d.get(y, 0) for y in x) for x in df2['data2']]

df2 = pd.concat([df2]*10000)

%timeit jpp(df1, df2)  # 32.8 ms per loop
%timeit jez(df1, df2)  # 49.1 ms per loop
jpp
  • 159,742
  • 34
  • 281
  • 339