Filling columns based on other dataframe columns

Question

I have two data sets

    df1 = pd.DataFrame ({"skuid" :("A","B","C","D"), "price": (0,0,0,0)})
    df2 = pd.DataFrame ({"skuid" :("A","B","C","D"),"salesprice" :(10,0,0,30),"regularprice" : (9,10,0,2)})

I want to insert sales price and regular price in price with conditions: If df1 skuid and df2 skuid matches and df2 salesprice is not zero, use salesprice as price value. if sku's match and df2 salesprice is zero, use regularprice. if not use zero as price value.

def pric(df1,df2):
if (df1['skuid'] == df2['skuid'] and salesprice !=0): 
 price = salesprice 
elif (df1['skuid'] == df2['skuid'] and regularprice !=0):
 price = regularprice
else:
 price = 0

I made a function with similar conditions but its not working. the result should look like in df1

skuid  price
  A      10
  B      10
  C      0
  D      30

Thanks.

It looks like the problem is with your function. Can you include it in your problem statement (and eliminate the pseudo-code, which should be deducible from the function) — Josh Purtell, Oct 15 '20 at 11:58
Yeah it looks like there are some issues with the function. Give me a second I'll write up a quick answer — Josh Purtell, Oct 15 '20 at 12:12

score 1 · Answer 1 · answered Oct 15 '20 at 13:00

So there are a number of issues with the function given above. Here are a few in no particular order:

Indentation in python matters https://docs.python.org/2.0/ref/indentation.html
Vectorized functions versus loops. The function you give looks vaguely like it expects to be applied on a vectorized basis, but python doesn't work like that. You need to loop through the rows you want to look at (https://wiki.python.org/moin/ForLoop). While there is support for column transformations in python (which work without loops), they need to be invoked specifically (here's some documentation for one instance of such functionality https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transform.html).
Relatedly, accessing dataframe elements and indexing Indexing Pandas data frames: integer rows, named columns
Return: if you want your python function to give you a result, you should have it return the value. Not all programming languages require this (julia), but in python you should/must.
Generality. This isn't strictly necessary in a one-off application, but your function is vulnerable to breaking if you change, for example, the column names in the dataframe. It is better practice to allow the user to give the relevant names in the input, for this reason and for simple flexibility.

Here is a version of your function which was more or less minimally change to fix the above specific issues

import pandas as pd

df1 = pd.DataFrame({"skuid" :("A","B","C","D"), "price": (0,0,0,0)})
df2 = pd.DataFrame({"skuid" :("A","B","C","D"),"salesprice" :(10,0,0,30),"regularprice" : (9,10,0,2)})


def pric(df1, df2, id_colname,df1_price_colname, df2_salesprice_colname,df2_regularprice_colname):
    for i in range(df1.shape[0]):
        for j in range(df2.shape[0]):
            if (df1.loc[df1.index[i],id_colname] == df2.loc[df2.index[j],id_colname] and df2.loc[df2.index[j],df2_salesprice_colname] != 0):
             df1.loc[df1.index[i],df1_price_colname] = df2.loc[df2.index[j],df2_salesprice_colname]
             break
            elif (df1.loc[df1.index[i],id_colname] == df2.loc[df2.index[j],id_colname] and df2.loc[df2.index[j],df2_regularprice_colname] != 0):
             df1.loc[df1.index[i],df1_price_colname]  = df2.loc[df2.index[j],df2_regularprice_colname]
             break
    return df1

for which entering


df1_imputed=pric(df1,df2,'skuid','price','salesprice','regularprice')
print(df1_imputed['price'])

gives

0    10
1    10
2     0
3    30
Name: price, dtype: int64

Notice how the function loops through row indices before checking equality conditions on specific elements given by a row-index / column pair.

A few things to consider:

Why does the code loop through df1 "above" the loop through df2? Relatedly, what purpose does the break condition serve?
Why was the else condition omitted?
What is 'df1.loc[df1.index[i],id_colname]' all about? (hint: check one of the above links)

Hi, when I run this, it never ends, the kernel keeps on running without showing any results. Why is that ? and how do I see results? thanks — Praveen Bushipaka, Oct 19 '20 at 15:44
@PraveenBushipaka you ran the exact code I have above and this happened? I just tried and I didn't encounter this problem — Josh Purtell, Oct 19 '20 at 17:36

Filling columns based on other dataframe columns

1 Answers1