0

I have a dataset similar to this one:

    Mother ID ChildID    ethnicity
0     1       1          White Other
1     2       2          Indian
2     3       3          Black
3     4       4          Other
4     4       5          Other
5     5       6          Mixed-White and Black

To simplify my dataset and make it more relevant to the classification I am performing, I want to categorise ethnicities into 3 categories as such:

  1. White: within this category I will include 'White British' and 'White Other' values
  2. South Asian: the category will include 'Pakistani', 'Indian', 'Bangladeshi'
  3. Other: 'Other', 'Black', 'Mixed-White and Black', 'Mixed-White and South Asian' values

So I want the above dataset to be transformed to:

    Mother ID ChildID    ethnicity
0     1       1          White
1     2       2          South Asian
2     3       3          Other
3     4       4          Other
4     4       5          Other
5     5       6          Other

To do this I have run the following code, similar to the one provided in this answer:


    col         = 'ethnicity'
    conditions  = [ (df[col] in ('White British', 'White Other')),
                   (df[col] in ('Indian', 'Pakistani', 'Bangladeshi')),
                   (df[col] in ('Other', 'Black', 'Mixed-White and Black', 'Mixed-White and South Asian'))]
    choices     = ['White', 'South Asian', 'Other']
        
    df["ethnicity"] = np.select(conditions, choices, default=np.nan)
    

But when running this, I get the following error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Any idea why I am getting this error? Am I not handling the string comparison correctly? I am using a similar technique to manipulate other features in my dataset and it is working fine there.

sums22
  • 1,793
  • 3
  • 13
  • 25

2 Answers2

2

I can not find why in is not working, but isin definitely solve the problem, maybe someone else can tell why in has a problem.

conditions  = [ (df[col].isin(('White British', 'White Other'))),
                (df[col].isin(('Indian', 'Pakistani', 'Bangladeshi'))),
                (df[col].isin(('Other', 'Black', 'Mixed-White and Black', 'Mixed-White and South Asian')))]
print(conditions)
choices     = ['White', 'South Asian', 'Other']

df["ethnicity"] = np.select(conditions, choices, default=np.nan)
print(df)

output

   Mother ID  ChildID    ethnicity
0          1        1        White
1          2        2  South Asian
2          3        3        Other
3          4        4        Other
4          4        5        Other
5          5        6          nan
Fangda Han
  • 387
  • 1
  • 3
  • 6
  • This does indeed fix the problem, similar to @jezrael answer to this question: https://stackoverflow.com/questions/56170164/check-if-string-in-list-of-strings-is-in-pandas-dataframe-column – sums22 Nov 13 '20 at 12:28
  • Shouldn't your output have 'Other' for the 6th row here? – sums22 Jun 16 '21 at 11:03
0

With df[col] in some_tuple you are searching df[col] inside some_tuple, which is obviously not what you want. What you want is df[col].isin(some_tuple), which returns a new series of booleans of the same length of df[col].

So, why you get that error anyway? The function for searching a value in a tuple is more or less like the following:

for v in some_tuple:
    if df[col] == v:
        return True
return False
  • df[col] == v evaluates to a series result; no problem here
  • then Python try to evaluate if result: and you get that error because you have a series in a condition clause, meaning that you are (implicitly) trying to evaluate a series as a boolean; this is not allowed by pandas.

For your problem, anyway, I would use DataFrame.apply. It takes a function that map a value to another; in your case, a function that map each ethnicity to a category. There are many ways to define it (see options below).


import numpy as np
import pandas as pd

d = pd.DataFrame({
    'field': range(6),
    'ethnicity': list('ABCDE') + [np.nan]
})

# Option 1: define a dict {ethnicity: category}
category_of = {
    'A': 'X',
    'B': 'X',
    'C': 'Y',
    'D': 'Y',
    'E': 'Y',
    np.nan: np.nan,
}
result = d.assign(category=d['ethnicity'].apply(category_of.__getitem__))
print(result)

# Option 2: define categories, then "invert" the dict.
categories = {
    'X': ['A', 'B'],
    'Y': ['C', 'D', 'E'],
    np.nan: [np.nan],
}
# If you do this frequently you could define a function invert_mapping(d):
category_of = {eth: cat
               for cat, values in categories.items()
               for eth in values}
result = d.assign(category=d['ethnicity'].apply(category_of.__getitem__))
print(result)

# Option 3: define a function (a little less efficient)
def ethnicity_to_category(ethnicity):
    if ethnicity in {'A', 'B'}:
        return 'X'
    if ethnicity in {'C', 'D', 'E'}:
        return 'Y'
    if pd.isna(ethnicity):
        return np.nan
    raise ValueError('unknown ethnicity: %s' % ethnicity)

result = d.assign(category=d['ethnicity'].apply(ethnicity_to_category))
print(result)
janluke
  • 1,567
  • 1
  • 15
  • 19
  • I understand what you are saying about evaluating a series as a boolean. But then why does this evaluation work for other features in my dataset. See this answer here: https://stackoverflow.com/questions/39109045/numpy-where-with-multiple-conditions/39111919#39111919 – sums22 Nov 13 '20 at 10:57
  • Also, how are NaNs handled in your code? In my code above, I handled them by passing default=np.nan in np.select. – sums22 Nov 13 '20 at 11:02
  • @sums22 You are not doing the same thing. If you use "in" you are calling a method of tuple that will result in the code I wrote above: somewhere in the code there will be an `if` that will have a series as a condition (the result of `series == value`). In that answer, you use operators like ">" that are defined in Series and will return another Series. There's no `if series:` involved. – janluke Nov 13 '20 at 16:34
  • @sums22 Handling NaNs is as easy as writing `category_of[np.nan] = np.nan`. Keep in mind that `apply` just want a function that maps a value into another. My code is just and example. You can define that function in many ways. I'll expand my answer. – janluke Nov 13 '20 at 16:38
  • @sums22 So, to recap, `series in tuple` is not the operation you want: you don't want to know if the series is inside the tuple, you want another series that tells you if the elements of `series` are in the tuple; that's what `series.isin(tuple)` does. – janluke Nov 13 '20 at 16:41