I have a dataset similar to this one:
Mother ID ChildID ethnicity
0 1 1 White Other
1 2 2 Indian
2 3 3 Black
3 4 4 Other
4 4 5 Other
5 5 6 Mixed-White and Black
To simplify my dataset and make it more relevant to the classification I am performing, I want to categorise ethnicities into 3 categories as such:
- White: within this category I will include 'White British' and 'White Other' values
- South Asian: the category will include 'Pakistani', 'Indian', 'Bangladeshi'
- Other: 'Other', 'Black', 'Mixed-White and Black', 'Mixed-White and South Asian' values
So I want the above dataset to be transformed to:
Mother ID ChildID ethnicity
0 1 1 White
1 2 2 South Asian
2 3 3 Other
3 4 4 Other
4 4 5 Other
5 5 6 Other
To do this I have run the following code, similar to the one provided in this answer:
col = 'ethnicity'
conditions = [ (df[col] in ('White British', 'White Other')),
(df[col] in ('Indian', 'Pakistani', 'Bangladeshi')),
(df[col] in ('Other', 'Black', 'Mixed-White and Black', 'Mixed-White and South Asian'))]
choices = ['White', 'South Asian', 'Other']
df["ethnicity"] = np.select(conditions, choices, default=np.nan)
But when running this, I get the following error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Any idea why I am getting this error? Am I not handling the string comparison correctly? I am using a similar technique to manipulate other features in my dataset and it is working fine there.