compare values in two rows of a dataframe, if they have the same value in another column

Question

A lot of questions been asked about similar topic but none seem to work for me. The closest one are In a dataframe, I want to compare column A and column B and extract value in which A >= B?. I need to find matching values in one column and if they are the same. In the case of this example, I want to get rid of all the suppliers that do not have the biggest number of orders placed. At this point nothing that i have tried seem to do the trick.

Here is a toy dataset to illustrate the problem:

 data = pd.DataFrame({'Id' : ['001', '001', '001', '001', '002', '002', '002', '003', '003', '003'],
                     'Supplier' : ['ABC', 'ABC', 'ABZ', 'ANA', 'ABA', 'ABC', 'ASS', 'ASS', 'ASS', 'ASS'],
                     'BuyPrice' : [7,7,7,8,7,5,7,7,7,7],
                     'Quantity' : [200,20,50,40,210,47,66,53,100,60]})

and what I am trying to achieve is:

dd = pd.DataFrame({'Id': ['001', '002', '003'],
                     'main_supplier' : ['ABC', 'ABA', 'ASS'],
                     'BuyPrice' : [7,7,7]})

where data_result['BuyPrice'] is the price associated with the main supplier

I have been able to do this:

dd = pd.DataFrame()
dd['Id'] = data1['Id']
dd['number_of_orders'] = pd.DataFrame(data1['Quantity']).applymap(lambda x: len(x))
dd['BuyPrice'] = pd.DataFrame(data1['BuyPrice']).applymap(lambda x: np.mean(x))
dd['Supplier'] = data1['Supplier']

for x in dd['Id']:
    if dd['Id'] == dd['Id']:

        dd['Principal_supplier'] = np.where(np.where((dd['number_of_orders'][x] > dd['number_of_order'][x-1]), 'Yes'))

    else:
        dd['Principal_supplier'] = 'No'
drop_Secondary_supplier = dd[dd['Principal_supplier' != 'Yes']].drop()

and i get the following error:

    self.__class__.__name__
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I am struggling a lot of with the logic of comparing the "number of orders" if and only if the "id" are the same.

Any help would be huge!

I try run your code and not working. Can you explain more your output DataFrame? `'BuyPrice' : [7,5,7]` ? — jezrael, Jan 09 '20 at 13:08
i made a typo, I adjusted it. In the code I calculate it differently, as in my real dataset, the buy price may differ although the supplier is the same — Murcielago, Jan 09 '20 at 13:47
What is your logic of data cleanning? Which observations you want to keep? I mean I am confused about your `data_result` — Travis, Jan 09 '20 at 14:18
i want to get rid of all the suppliers that do not constitute the biggest number of orders, in the goal of saving only the "principal supplier" — Murcielago, Jan 09 '20 at 14:20

score 1 · Accepted Answer · answered Jan 09 '20 at 14:49

1

You can groupby based on Id and get the index of it and display the corresponding data. It's pretty much self explanatory. Let me know if you have any questions.

 data.loc[data.groupby('Id')['Quantity'].idxmax(), ['Id', 'BuyPrice', 'Supplier']]

answered Jan 09 '20 at 14:49

PraveenB

1,270
10
11

compare values in two rows of a dataframe, if they have the same value in another column

1 Answers1