1

I want to write a function that iterates through a dataframe, and takes each row's value as an argument. For example:

My pandas dataframe is as follows:

category  sales  met_sales
fruit     100    False
books     200    False
fruit     300    False

I have a dictionary: required_sales = {'fruit':150, 'books':200}

The output I want is this:

category  sales  met_sales
fruit     100    False
books     200    True
fruit     300    True

Is it correct to structure my function like that?

def met_sales(df, dict):
    for row in df:
        if row.sales > required_sales[row.category]:
             #update met_sales column
             row.met_sales = True

Then, I can simply call met_sales(df,required_sales) to update my DataFrame.

Is this a good way of using self created functions to modify my DataFrame?

Gen Tan
  • 858
  • 1
  • 11
  • 26

1 Answers1

1

Use Series.map for dictionary and compare with column sales:

df['met_sales'] = df['sales'] >= df['category'].map(required_sales)
print (df)
  category  sales  met_sales
0    fruit    100      False
1    books    200       True
2    fruit    300       True

Detail:

print (df['category'].map(required_sales))
0    150
1    200
2    150
Name: category, dtype: int64

Function:

Dont use dict as variable name as it is a reserved word for builtin python dict.

def met_sales(df, d):
    df['met_sales'] = df['sales'] >= df['category'].map(d)
    return df

df1 = met_sales(df,required_sales)
print (df1)
  category  sales  met_sales
0    fruit    100      False
1    books    200       True
2    fruit    300       True

Notice:

It is necessary that all values of category are present in your dict, else missing values are returned for not existing keys:

required_sales = {'fruit':150}

print (df['category'].map(required_sales))
0    150.0
1      NaN
2    150.0
Name: category, dtype: float64
MSS
  • 3,306
  • 1
  • 19
  • 50
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Often, my task requires me to look up other DataFrame or data structures to update my DataFrame. Is it correct to say that I should not use any self created functions at all, and use a mixture of .map(), creating new columns, and boolean logic to modify my dataframe? – Gen Tan Oct 24 '19 at 06:17
  • @GenTan - hmmm, it depends. If performance is important or large data (10k+ rows) then better is use `map`, because functions are obviously slow, map is faster. – jezrael Oct 24 '19 at 06:20
  • If always few rows it is up to you, but [here](https://stackoverflow.com/a/24871316/2901002) is possible check table of priority generally in pandas. – jezrael Oct 24 '19 at 06:23