2

I would like to apply diff functions to each row in the dataframe based on which category it is in.

def weigh_by_education(row):
    if education_income.education_level == 'College':
        return row / 1013
    if education_income.education_level == 'Doctorate':
        return row / 451
    if education_income.education_level == 'Graduate':
        return row / 3128
    if education_income.education_level == 'High School':
        return row / 2013
    if education_income.education_level == 'Post-Graduate':
        return row / 516
    if education_income.education_level == 'Uneducated':
        return row / 1487
    else:
        return row / 1519

Heres my function. - I applied it to my dataframe to try to create a new column called percent -> the number of users in each education_level weighted by the total number of people in that respective category.

education_income['percent'] = education_income['user_id'].apply(lambda row: weigh_by_education(row))

However each time it throws a ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

My original dataframe is grouped by columns: education_level, income_category. In the values column is the user counts. I want to weigh the user counts by the total number of people in each education_level category. What can I do?

Terry Jung
  • 47
  • 4
  • You apply your function on 'user_id' column by this education_income['user_id']. Is this what you really want? – IoaTzimas Dec 19 '20 at 12:51

1 Answers1

1

You might wanna use numpy.select

conditions = [
    df.education_level == 'College',
    df.education_level == 'Doctorate'
]

values = [
    df.some_column / 1013,
    df.some_column / 451
]

df['percent'] = np.select(conditions, values)
Vishnudev Krishnadas
  • 10,679
  • 2
  • 23
  • 55