0

I currently have a piece of code which uses np.where to check if each row in a dataframe matches certain conditions, then sets a new column value.

df['new_column'] = np.where(((df['col_1'].str.contains('a')) & (df['col_2'].str.contains('b')) | ((df['col_1'].str.contains('c')) & (df['col_2'].str.contains('d')), "Yes", "No")

As you can see, this isnt very clear, or maintainable as this definition of Yes and No for the new_column changes. I was thinking about using a dictionary to define what Yes and No would be and then use map for the new column, but not sure how to do this as Yes and No are based on multiple column values.

Any suggestions here for improving this existing code?

EDIT: @JANO - this is a slight improvement I think as its a little more structured and maintainable. Let me give an example of what I am trying to do. Basically, lets say we have the below dataframe, and we want to create a new column 'High Paying Job'.

       field       profession
0    Medical          Surgeon
1    Medical  Home Health Aid
2     Sports            Scout
3     Sports          Athlete
4  Education          Teacher
5  Education        Principal

If we make some assumptions, then I would want the new column to have the following values:

       field       profession high paying job
0    Medical          Surgeon             Yes
1    Medical  Home Health Aid              No
2     Sports            Scout              No
3     Sports          Athlete             Yes
4  Education          Teacher              No
5  Education        Principal             Yes
JSmth
  • 31
  • 1
  • 9
  • 'map definition' is unclear and buggy, because as well as just being equal to a single letter like 'a' or 'b' or 'c' or 'd', the string could be 'cad' or 'babbc' or 'a fish called wanda'. Or it could have leading and trailing whitespace or punctuation. Or be empty ''. If you only care about the case when it's guaranteed to be one single-letter value, just use [`isin(['a','b','c','d')`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isin.html) – smci Mar 13 '22 at 20:52
  • [Pandas: What is the difference between isin() and str.contains()?](https://stackoverflow.com/questions/53079487/pandas-what-is-the-difference-between-isin-and-str-contains) – smci Mar 13 '22 at 21:17

1 Answers1

0

Not completely sure if it is better but you can store the conditions as lambda functions. For example, you could store all conditions that lead to "Yes" in a list like this:

conditions = [
    lambda x: x['col_1'].str.contains('a') & x['col_2'].str.contains('b'),
    lambda x: x['col_1'].str.contains('c') & x['col_2'].str.contains('d')    
]

Afterwards, you can use a list comprehension to evaluate each condition and sum to aggregate the evaluations:

df['new_column'] = np.where(sum([x(df) for x in conditions]), "Yes", "No")

Not sure if this is what you had in mind, but it should have the same output as your code.

Edit: As OP did not add any test data I created a simple dataframe to check if solutions create the same output:

df = pd.DataFrame({'col_1': ['a', 'ab', 'b', 'a', 'bc', 'bad', '', 'abd','ab', ' x?sc'], 
                   'col_2': ['c', 'b', 'a', 'cd', 'd', 'abc', 'abcd','acd', 'bd', 'asnn.d']})


# Define the conditions that should result into true
df['new_column'] = np.where((df['col_1'].str.contains('a')) & (df['col_2'].str.contains('b')) | (df['col_1'].str.contains('c')) & (df['col_2'].str.contains('d')), "Yes", "No")

print(df.to_string())
   col_1   col_2 new_column
0      a       c         No
1     ab       b        Yes
2      b       a         No
3      a      cd         No
4     bc       d        Yes
5    bad     abc        Yes
6           abcd         No
7    abd     acd         No
8     ab      bd        Yes
9   x?sc  asnn.d        Yes
JANO
  • 2,995
  • 2
  • 14
  • 29
  • There is no need for lambdas and no need for four calls, `str.contains()` can do this all in one regex. – smci Mar 13 '22 at 21:19
  • 1
    There is certainly no need for lambdas. I only said it is a way OP can separate the conditions from the call. I am very interested in your solution that only needs a single regex for this task. I also created a sample dataframe to test it. Would you mind sharing your regex idea? – JANO Mar 13 '22 at 21:35
  • @JANO thanks for the comment here. Just wondering, what is the purpose of the sum call in the np.where? – JSmth Mar 17 '22 at 20:31
  • You're welcome. Using `sum` is a very nice trick to OR boolean conditions. In your example we evaluate the conditions (True or False), which can be represented as integers with the values zero and one (True is one). If we use `sum` on a list of booleans we effectively count the number of True values. So, three True values and two False values would result into the value `3`. Converting the value `3` back to boolean is then again True (everything except 0 is True). This True value is then used by `np.where()`. – JANO Mar 17 '22 at 23:11