1

Say I have 2 dataframes:

df1

     id       guid               name      item1        item2        item3        item4        item5         item6       item7        item8       item9
0  3031958124  85558-261955282  Alonso  85558-57439  85558-54608  85558-91361  85558-40647  85558-41305  85558-79979  85558-33076  85558-89956  85558-12554
1  3031958127  85558-261955282  Jeff    85558-57439  85558-39280  85558-91361  85558-55987  85558-83083  85558-79979  85558-33076  85558-41872  85558-12554
2  3031958129  85558-261955282  Mike    85558-57439  85558-39280  85558-91361  85558-55987  85558-40647  85558-79979  85558-33076  85558-88297  85558-12534
...

df2 where item_lookup is the index

             item_type   cost  value  target 
item_lookup
85558-57439  item1       9500   25.1   1.9
85558-54608  item2       8000   18.7   0.0 
85558-91361  item3       7000   16.5   0.9
...

I want to add the sum of cost, value, and target for each item1 through item9 using item_lookup (df2) and store that as a column on df1.

So the result should look like: df1

     id       guid               name      item1        item2        item3        item4        item5         item6       item7        item8       item9       cost   value  target
0  3031958124  85558-261955282  Alonso  85558-57439  85558-54608  85558-91361  85558-40647  85558-41305  85558-79979  85558-33076  85558-89956  85558-12554  58000   192.5   38.3
1  3031958127  85558-261955282  Jeff    85558-57439  85558-39280  85558-91361  85558-55987  85558-83083  85558-79979  85558-33076  85558-41872  85558-12554  59400   183.2   87.7
2  3031958129  85558-261955282  Mike    85558-57439  85558-39280  85558-91361  85558-55987  85558-40647  85558-79979  85558-33076  85558-88297  85558-12534  58000   101.5   18.1
...

I've tried following similar solutions online that use .map, however these examples are only for single columns whereas I am trying to sum values for 9 columns.

Sean Sailer
  • 343
  • 1
  • 12
  • Please share more detail. How will you add the columns? and how will you merge them to d2? The input and output cannot be related to each other in this sample. – ali bakhtiari Jan 07 '23 at 20:26
  • @alibakhtiari I am asking the question to figure out how to add the columns. They are new columns on df1, resulting from the sum of item1 through item9 lookups in df2. For example in the resulting df1, column "Cost" for row 0 would be the sum of df2['cost'] for item1, df2['cost'] for item2, and so on for for all 9 items. Likewise for value and target columns. – Sean Sailer Jan 07 '23 at 20:46
  • So do you want to group by `cos, value, target ` sum over these groups and add them to df1? – ali bakhtiari Jan 07 '23 at 20:57
  • The result should be added to df1, not df2. And I'm not sure if groupby is what I want. I want to sum the cost, value and target columns of df2 for each item in each row of df1. For example, on row 0, there are 9 items with a key like 85558-57439. This key is the index of df2, and df2 has the cost, value and target columns associated with this key. For each 9 items in df1, I want to lookup the cost, value and target of those 9 items, sum them, and add them as their own column in df1. – Sean Sailer Jan 07 '23 at 21:01

3 Answers3

3

You can do this by using df.apply, basically looping through the rows and then looping through the items in the row and calculating the sum


Since i couldn't use your dfs because they are incomplete, i made mine.

given df1:

  item1 item2 item3
0     b     e     j
1     d     a     d
2     j     b     a
3     c     j     f
4     e     f     c
5     a     d     b
6     f     c     e

and df2

             cost  value  target
item_lookup                     
a              19     20      12
b              16     14      14
c              20     18      18
d              17     12      14
e              20     15      17
f              19     20      12
j              11     17      12

you can use the following function to get what you need

def add_items(row):
     row["cost"] = row["target"] = row["value"] = 0
     # get the columns that have item in the name
     cols = [col for col in df1.columns if "item" in col]
     # get each of the columns look it up in df2 and add it to our new cols
     for col in cols:
         item_lookup = row[col]
         lookup_result = df2.loc[item_lookup]
         row["cost"] += lookup_result["cost"]
         row["target"] += lookup_result["target"]
         row["value"] += lookup_result["value"]
     return row

and then apply it

>>> df1.apply(add_items, axis=1)
  item1 item2 item3  cost  target  value
0     b     e     j    47      43     46
1     d     a     d    53      40     44
2     j     b     a    46      38     51
3     c     j     f    50      42     55
4     e     f     c    59      47     53
5     a     d     b    52      40     46
6     f     c     e    59      47     53
zaki98
  • 1,048
  • 8
  • 13
  • Thank you! this is the solution. I only had to change the apply statement to re-assign the dataframe afterwards: `df1 = df1.apply(add_items, axis=1)` – Sean Sailer Jan 07 '23 at 21:18
  • Using multiple loops and `.loc` for assignments is not an efficient solution for large datasets. Check out the answer I just posted. – ali bakhtiari Jan 07 '23 at 21:30
  • 3
    @alibakhtiari i usually don't worry about performance until it becomes a problem. so for the question he asked this worked – zaki98 Jan 07 '23 at 21:34
2

I got a simpler solution here. First, save the item_lookup values that correspond with cost, target and value to a dictionary. and then use .map() and .sum() to create columns:

df2.reset_index(drop=False, inplace=True)

map_cost = dict(zip(df2['item_lookup'], df2['cost']))
map_value = dict(zip(df2['item_lookup'], df2['value']))
map_target = dict(zip(df2['item_lookup'], df2['target']))


df1['cost'] = df1.apply(lambda x: x.map(map_cost)).sum(axis=1)
df1['value'] = df1.apply(lambda x: x.map(map_value)).sum(axis=1)
df1['target'] = df1.apply(lambda x: x.map(map_target)).sum(axis=1)


df1

Output:

           id             guid    name        item1        item2        item3     cost  value  target
0  3031958124  85558-261955282  Alonso  85558-57439  85558-54608  85558-91361  24500.0   60.3     2.8
1  3031958127  85558-261955282    Jeff  85558-57439  85558-39280  85558-91361  16500.0   41.6     2.8
2  3031958129  85558-261955282    Mike  85558-57439  85558-39280  85558-91361  16500.0   41.6     2.8
zaki98
  • 1,048
  • 8
  • 13
ali bakhtiari
  • 1,051
  • 4
  • 23
1

I prefer the following solution that has elements of the solutions proposed by @ali bakhtiari and @zaki98, but is more explicit, performant, and flexible. Use applymap as the item lookup is the same for all item cols, assuming item_lookup in df2 uniquely identifies each row (all solutions assume this); however, the solution I propose also handles the case of an item_lookup in df1 that is not present in df2. For df1 and `df2 as follows,

df1

df2

define columns to sum, sum_cols, and item columns in df1, item_cols, then append each summed column to df1 as follows:

sum_cols = ['cost', 'value', 'target']
item_cols = [col for col in df1.columns if 'item' in col]

df2.set_index('item_lookup', inplace=True)
for sc in sum_cols:
    df1[sc] = df1[item_cols] \
                    .applymap(lambda x: df2.at[x, sc] if x in df2.index else 0) \
                    .sum(axis=1)

This problem also seems like a good use case for the more performant at over loc as only a single, numeric value is looked up each time (see this SO post). It's not necessary to set item_lookup as the index on df2 but doing so should improve performance on large datasets. If an item_lookup in df1 is not present in df2, you could also replace with NaN, e.g. so that the number of missing item_lookup values could be counted, with minimal additional effort.

Output df1:

enter image description here