1

Trying to incorporate the following case statement in my code.

d1['newcol'] = np.where((d1['business_name'] == 'HELP')&(d1['level'] == 'Component'), d1['component'], np.where((d1['business_name'] == 'HELP')&(d1['level'] == 'Package'),d1['package'], np.where(d1['business_name'] == 'PHELPS',d1['reporting']))) 

Or the case pseudocode would look like:

   newcol=case WHEN ((business_name = 'HELP' AND level = 'Component')) THEN component WHEN ((business_name = 'HELP' AND level = 'Package')) THEN package WHEN (business_name = 'PHELS') THEN reporting END

the error Im getting is:

ValueError: either both or neither of x and y should be given

any help would be appreciated.

excelguy
  • 1,574
  • 6
  • 33
  • 67
  • 1
    You may want to consider using [`numpy.select`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.select.html) as an alternative to chaining `where` statements. – pault Nov 25 '19 at 20:54
  • To your specific error- ```np.where``` takes either 1 condition, then it returns Boolean matrix, or 3 (of the same shape). So to fix it- just add `````` part to your ```numpy.where``` (make it ```zeros()``` for instance). – Grzegorz Skibinski Nov 25 '19 at 21:27
  • which part? would you want to submit an answer? – excelguy Nov 25 '19 at 21:38
  • 1
    This is 100% `np.select`: see the [accepted answer](https://stackoverflow.com/a/19913845/9081267), the second part where `np.select` is used – Erfan Nov 26 '19 at 12:15

2 Answers2

2

You might want to apply a lambda function across all rows of the dataframe.

df['newcol'] = df.apply(lambda row: 
                        row['component'] if row['business_name'] == 'help' and row['level'] == 'Component' # 1st condition
                           else (row['package'] if row['business_name'] == 'HELP' and row['level'] == 'Package' # 2nd condition
                                 else (row['reporting'] if row['business_name'] == 'PHELPS' # 3rd condition
                                       else np.nan) # NAN if no conditions matched
                           )
                       , axis=1)
Brandon
  • 918
  • 6
  • 14
  • getting this error `ValueError: ('The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().', 'occurred at index 0')`, going to keep testing. – excelguy Nov 25 '19 at 21:48
  • Edited. Value error should be fixed now – Brandon Nov 25 '19 at 22:02
  • This might be sufficient for you but using `apply` will be slower than using `np.select`, significantly so if your DataFrame is large. – pault Nov 26 '19 at 16:43
1

Here is an approach that can achieve the result. Using other sample data.

import pandas as pd
import numpy as np
from io import StringIO

# Sample data
df = pd.read_csv(StringIO(
"""a,b
1,1
1,2
2,1
2,2"""))

# Create new column using conditional indexing
df["c"] = ""
df.loc[df["a"] == 1, "c"] = "one something"
df.loc[(df["a"] == 2) & (df["b"] == 1), "c"] = "two one"
df.loc[(df["a"] == 2) & (df["b"] != 1), "c"] = "two something"
df

# Out[29]: 
#    a  b              c
# 0  1  1  one something
# 1  1  2  one something
# 2  2  1        two one
# 3  2  2  two something
Andrew Lavers
  • 4,328
  • 1
  • 12
  • 19