python panda groupby and eliminate duplicates

Question

I want to go to as few stores as possible to get my products. How do I do this? I have a list of stores that carry a specific product.

wanted_Products = pd.DataFrame({'p':[1,2,3,4,5,6,7]})
stores = pd.DataFrame({'Store': np.repeat(np.arange(1,5),4),
                       'Product': [1,2,3,5,0,2,3,4,0,6,7,8,0,1,2,6]})
# return 1 if the Product is wanted
stores['Wanted'] = stores.Product.isin(wanted_Products.p).values.astype(int)

     Store  Product  Wanted
0       1        1       1
1       1        2       1
2       1        3       1
3       1        5       1
4       2        0       0
5       2        2       1
6       2        3       1
7       2        4       1
8       3        0       0
9       3        6       1
10      3        7       1
11      3        8       0
12      4        0       0
13      4        1       1
14      4        2       1
15      4        6       1

# Group products per store and calculate how many wanted products are in a store
w = stores.groupby('Store', as_index=False).agg(list)
w['Number_wanted'] = stores.groupby('Store', as_index=False)['Wanted'].sum().agg(list)['Wanted']

      Store  Product        Wanted         Number_wanted  ?Products_wanted?
0      1  [1, 2, 3, 5]  [1, 1, 1, 1]              4            [1,2,3,5]
1      2  [0, 2, 3, 4]  [0, 1, 1, 1]              3            [2,3,4]
2      3  [0, 6, 7, 8]  [0, 1, 1, 0]              2            [6,7]
3      4  [0, 1, 2, 6]  [0, 1, 1, 1]              3            [1,2,6]

How do I get the Products I want in a new column (Products_wanted) without the non wanted products? when I use isin() I only get true/false (1/0 if I use astype(int)) not the actual numbers.

What have you tried so far and what went wrong with your attempts? Please [edit] to include a [mcve] — G. Anderson, Jun 08 '21 at 17:47
A good place to start: [Pandas groupby list of values](https://stackoverflow.com/questions/22219004/how-to-group-dataframe-rows-into-list-in-pandas-groupby/) — G. Anderson, Jun 08 '21 at 17:49
I treid something that i did with whitelisting stores = (stores.assign(key=lambda x: x["Store"].sub(0)) .merge(wanted_products left_on="key", right_index=True, how="left").drop("key", axis="columns")). First I make wanted_products to a panda — Johanna Marklund, Jun 08 '21 at 17:50
@Johanna Marklund you should definitely include your above comment in the question itself. That will show people that you have made a good faith effort to solve the problem, and can also help other people build upon what you have tried to get you an answer — Derek O, Jun 09 '21 at 05:19

score 2 · Accepted Answer · answered Jun 08 '21 at 19:35

A way to do this could be to keep track of all products that are availabe in a store, get them and then mark those products as "taken" so that you do not pick the same in next stores.

So initially you have wanted_Products= [1,2,3,4,5,6,7]. Since you are getting [1, 2, 3, 5] from store 1, you select and return these as products to get from store 1, and then mark all these as "taken" simply by replacing these values in wanted_Products to something else like say -1 (or some other value you like, to denote they have been taken).

Now wanted_Products= [-1,-1,-1,4,-1,6,7]. -1 ones are taken so you only have [4,6,7] products to get from next stores. Repeating the same logic for all stores will give you the products to get from there, without any duplicates:

def get_products(possible, wanted):
    i = np.where(np.in1d(wanted, possible))
    available = wanted[i]
    wanted[i] = -1
    return available

w = stores.groupby('Store', as_index=False).agg(list)
w['Products to get'] = w.Product.apply(get_products, args=(np.array(wanted_Products),))

Output:

>>> w
   Store       Product Products to get
0      1  [1, 2, 3, 5]    [1, 2, 3, 5]
1      2  [0, 2, 3, 4]             [4]
2      3  [0, 6, 7, 8]          [6, 7]
3      4  [0, 1, 2, 6]              []

It almost works, but if most products are in store 3 or 4 it will not work. — Johanna Marklund, Jun 09 '21 at 06:23
I didn't get you. Can you give an example where it won't work? — Ank, Jun 09 '21 at 06:48
if you switch products from store 1 and 3, then the product 2,3 will be taken in store 2 even if more products are gathered from store 3( after switch) I have used your as_index for getting a nice dataFrame. thank you, this was also a problem that I ran into. — Johanna Marklund, Jun 09 '21 at 07:06

score 1 · Answer 2 · answered Jun 09 '21 at 18:19

To respect your optimization criteria (always buy from the store that has the greatest number of products from your list), the list of products per store needs to sorted again on every iteration: every time you decide that you're getting a set of products from a given store, the remaining lists need to be cleaned up (remove already-bought products) and re-ordered by length.

As a technical note, I'm converting your lists to sets, because you don't want duplicates, so it's ok to do it, and it gives us set operations : intersection (to check which wanted products are available in a given store) and difference (remove already boughtproducts from the wanted list.)

The code is not too elegant, but I included quite a few comments:

stores = pd.DataFrame({'Store': np.repeat(np.arange(1,5),4),
                   'Product': [1,2,3,5,0,2,3,4,0,6,7,8,0,1,2,6]})
# stores = pd.DataFrame({'Store': np.repeat(np.arange(1,5),4),
#                    'Product': [0,2,7,6,0,2,4,8,1,2,7,6,1,2,3,5]})

w = stores.groupby('Store', as_index=False).agg(list)
w['Products to get'] = np.nan
w['Products to get'] = w['Products to get'].astype('object')

wanted_Products = [1,2,3,4,5,6,7]
wanted = set(wanted_Products)

tmp = w[['Store', 'Product']]
while len(wanted) > 0:
    # Removed unwanted products (set intersection)
    tmp['Product'] = tmp.Product.apply(lambda x: set(x) & wanted)
    
    # Sort on length of product sets
    tmp['lengths'] = tmp.Product.str.len()
    tmp = tmp.sort_values(by='lengths', ascending=False).drop('lengths', 1)

    # Get products from this store, remove them from wanted set
    get = tmp.loc[tmp.index[0], 'Product'] & wanted
    wanted -= get

    # Update Products to get for this store
    row = w[w['Store'] == tmp.loc[tmp.index[0], 'Store']]
    w.at[row.index[0], 'Products to get'] = get

    # Remove the largest product set, work on the remaining ones
    tmp = tmp.iloc[1:, ]

Here's the output:

In [71]: w
Out[71]: 
   Store       Product Products to get
0      1  [1, 2, 3, 5]    {1, 2, 3, 5}
1      2  [0, 2, 3, 4]             {4}
2      3  [0, 6, 7, 8]          {6, 7}
3      4  [0, 1, 2, 6]             NaN

With more products in stores 3 and 4, it still works:

stores = pd.DataFrame({'Store': np.repeat(np.arange(1,5),4),
                   'Product': [0,2,7,6,0,2,4,8,1,2,7,6,1,2,3,5]})

Output is:

In [76]: w
Out[76]: 
   Store       Product Products to get
0      1  [0, 2, 7, 6]             NaN
1      2  [0, 2, 4, 8]             {4}
2      3  [1, 2, 7, 6]    {1, 2, 6, 7}
3      4  [1, 2, 3, 5]          {3, 5}

python panda groupby and eliminate duplicates

2 Answers2