1

My question is simple - I have the following table:

+----------+-------+------------+--------+
| industry | class | occupation | value  |
+----------+-------+------------+--------+
|      170 |     4 |       1000 |  123.3 |
|      180 |     7 |       3600 | 4543.8 |
|      570 |     5 |        990 |  657.4 |
+----------+-------+------------+--------+

I'd like to create a new column called "type". The value of this column is based on these multiple conditions

  • Class = 7 : QWE
  • Class = 8 : ASD
  • Class = 1 or 2 : ZXC
  • Class = 4, 5, or 6 AND Industry = 170-490 or 570-690 AND Occupation >=1000 : IOP
  • Class = 4, 5, or 6 AND Industry = 170-490 or 570-690 AND Occupation between 10-3540 : JKL
  • Anything else : BNM

The resulting table will look like this:

+----------+-------+------------+--------+------+
| industry | class | occupation | value  | type |
+----------+-------+------------+--------+------+
|      170 |     4 |       1000 |  123.3 | IOP  |
|      180 |     7 |       3600 | 4543.8 | QWE  |
|      570 |     5 |        990 |  657.4 | JKL  |
+----------+-------+------------+--------+------+

My first approach to this was basically creating multiple dataframes of each type using the dataframe query method. However, I found out about the numpy "where" method and I am currently using a nested version of that to create the "type" column in one step. However, I feel like it's un-readable and I can imagine a situation where there are even more conditions that would make this process look really messy. Is there a cleaner way of doing this? Maybe with a dictionary or something?

jerbear
  • 361
  • 5
  • 14

1 Answers1

6

Set up your conditions and outputs and store in lists:

a = df['class'].eq(7)  
b = df['class'].eq(8)  
c = df['class'].isin([1,2])    
helper = df['class'].isin([4,5,6]) & (df.industry.isin(range(170, 491)) | df.industry.isin(range(570, 691)))
d =  helper & df.occupation.ge(1000)
e = helper & df.occupation.isin(range(10, 3541))

conds = [a, b, c, d, e]
outs = ['QWE', 'ASD', 'ZXC', 'IOP', 'JKL']

Use np.select. Just note that you have overlapping conditions, so there is a possibility for ambiguity between IOP and JKL

df['out'] = np.select(conds, outs, default='BNM')

   industry  class  occupation   value  out
0       170      4        1000   123.3  IOP
1       180      7        3600  4543.8  QWE
2       570      5         990   657.4  JKL
user3483203
  • 50,081
  • 9
  • 65
  • 94
  • 1
    np.select is exactly what I was looking for! Also, you're right about the overlapping conditions - the conditions are a bit more complex in my actual code but I just wanted to simplify it and get the idea across. Thank you! – jerbear Jun 26 '18 at 15:26
  • 1
    Glad to help. Just in the case you do have overlapping conditions, `np.select` will choose the first match it finds in your list of conditions. Happy programming! – user3483203 Jun 26 '18 at 15:27