Clean way to create column based on multiple conditions

Question

My question is simple - I have the following table:

+----------+-------+------------+--------+
| industry | class | occupation | value  |
+----------+-------+------------+--------+
|      170 |     4 |       1000 |  123.3 |
|      180 |     7 |       3600 | 4543.8 |
|      570 |     5 |        990 |  657.4 |
+----------+-------+------------+--------+

I'd like to create a new column called "type". The value of this column is based on these multiple conditions

Class = 7 : QWE
Class = 8 : ASD
Class = 1 or 2 : ZXC
Class = 4, 5, or 6 AND Industry = 170-490 or 570-690 AND Occupation >=1000 : IOP
Class = 4, 5, or 6 AND Industry = 170-490 or 570-690 AND Occupation between 10-3540 : JKL
Anything else : BNM

The resulting table will look like this:

+----------+-------+------------+--------+------+
| industry | class | occupation | value  | type |
+----------+-------+------------+--------+------+
|      170 |     4 |       1000 |  123.3 | IOP  |
|      180 |     7 |       3600 | 4543.8 | QWE  |
|      570 |     5 |        990 |  657.4 | JKL  |
+----------+-------+------------+--------+------+

My first approach to this was basically creating multiple dataframes of each type using the dataframe query method. However, I found out about the numpy "where" method and I am currently using a nested version of that to create the "type" column in one step. However, I feel like it's un-readable and I can imagine a situation where there are even more conditions that would make this process look really messy. Is there a cleaner way of doing this? Maybe with a dictionary or something?

does this question help? https://stackoverflow.com/questions/17071871/select-rows-from-a-dataframe-based-on-values-in-a-column-in-pandas — nahusznaj, Jun 26 '18 at 15:00
Try defining a function which returns the appropriate label for each condition and use `pandas.DataFrame.apply` — Gonzalo Donoso, Jun 26 '18 at 15:12

user3483203 · Accepted Answer · 2018-06-26T15:40:01.407

6

Set up your conditions and outputs and store in lists:

a = df['class'].eq(7)  
b = df['class'].eq(8)  
c = df['class'].isin([1,2])    
helper = df['class'].isin([4,5,6]) & (df.industry.isin(range(170, 491)) | df.industry.isin(range(570, 691)))
d =  helper & df.occupation.ge(1000)
e = helper & df.occupation.isin(range(10, 3541))

conds = [a, b, c, d, e]
outs = ['QWE', 'ASD', 'ZXC', 'IOP', 'JKL']

Use np.select. Just note that you have overlapping conditions, so there is a possibility for ambiguity between IOP and JKL

df['out'] = np.select(conds, outs, default='BNM')

   industry  class  occupation   value  out
0       170      4        1000   123.3  IOP
1       180      7        3600  4543.8  QWE
2       570      5         990   657.4  JKL

edited Jun 26 '18 at 15:40

answered Jun 26 '18 at 15:16

user3483203

50,081
9
65
94

1

np.select is exactly what I was looking for! Also, you're right about the overlapping conditions - the conditions are a bit more complex in my actual code but I just wanted to simplify it and get the idea across. Thank you! – jerbear Jun 26 '18 at 15:26
1

Glad to help. Just in the case you do have overlapping conditions, `np.select` will choose the first match it finds in your list of conditions. Happy programming! – user3483203 Jun 26 '18 at 15:27

Clean way to create column based on multiple conditions

1 Answers1