0

New to python. I am trying to figure out the best way to create a column based on other columns. Ideally, the code would be as such.

df['new'] = np.where(df['Country'] == 'CA', df['x'], df['y'])

I do not think this works because it thinks that I am calling the entire column. I tried to do the same thing with apply but was having trouble with syntax.

df['my_col'] = df.apply(
    lambda row: 
    if row.country == 'CA':
        row.my_col == row.x
        else:
            row.my_col == row.y

I feel like there must be an easier way.

  • 2
    Are you sure the `np.where()` version doesn't work? – Barmar May 27 '22 at 00:31
  • I got the error operands could not be broadcast together with shapes (550357,) (550357,2) (550357,2). Also I was thinking that if I were to have more than one condition, I would not be able to use np.where – r-learning-machine May 27 '22 at 00:33
  • 2
    there is nothing wrong with your np.where code. Check that your syntax for your actual code is the same syntax as what you posted here. And don't use your second block of code. If you are new to python and pandas, familiarize yourself with vectorized methods. – David Erickson May 27 '22 at 00:35
  • 1
    You can combine multiple conditions with `&` and `|`. – Barmar May 27 '22 at 00:37
  • 1
    Your lambda should be `lambda row: row.x if row.country == 'CA' else row.y`, but the `where` thing should work. Remember that a lambda should have no side effects -- it is just an expression that returns a value. – Tim Roberts May 27 '22 at 00:38
  • 3
    That error could not have been raised unless `df` isn't a data frame or those columns have nested objects. Please provide a [reproducible example](https://stackoverflow.com/q/20109391/1422451). – Parfait May 27 '22 at 00:39
  • 1
    Instead of writing `lambda`, it's always possible to just create an ordinary function and pass that. `lambda` is *just* a way to define simple functions (ones where the logic can be expressed as "compute the value of this expression and return it) and not have to give them a name. – Karl Knechtel May 27 '22 at 00:58
  • Also, think carefully about how that function would work. The point of `.apply` is that the function that you pass to it will `return` the modified value, and **it** will modify the Dataframe. Don't put the modification logic inside the function. – Karl Knechtel May 27 '22 at 00:59

2 Answers2

2

Any of these three approaches (np.where, apply, mask) seems to work:

df['where'] = np.where(df.country=='CA', df.x, df.y)
df['apply'] = df.apply(lambda row: row.x if row.country == 'CA' else row.y, axis=1)
mask = df.country=='CA'
df.loc[mask, 'mask'] = df.loc[mask, 'x']
df.loc[~mask, 'mask'] = df.loc[~mask, 'y']

Full test code:

import pandas as pd
import numpy as np
df = pd.DataFrame({'country':['CA','US','CA','UK','CA'], 'x':[1,2,3,4,5], 'y':[6,7,8,9,10]})
print(df)

df['where'] = np.where(df.country=='CA', df.x, df.y)
df['apply'] = df.apply(lambda row: row.x if row.country == 'CA' else row.y, axis=1)
mask = df.country=='CA'
df.loc[mask, 'mask'] = df.loc[mask, 'x']
df.loc[~mask, 'mask'] = df.loc[~mask, 'y']
print(df)

Input:

  country  x   y
0      CA  1   6
1      US  2   7
2      CA  3   8
3      UK  4   9
4      CA  5  10

Output

  country  x   y  where  apply  mask
0      CA  1   6      1      1   1.0
1      US  2   7      7      7   7.0
2      CA  3   8      3      3   3.0
3      UK  4   9      9      9   9.0
4      CA  5  10      5      5   5.0
constantstranger
  • 9,176
  • 2
  • 5
  • 19
1

This might also work for you

data = {
    'Country' : ['CA', 'NY', 'NC', 'CA'], 
    'x' : ['x_column', 'x_column', 'x_column', 'x_column'],
    'y' : ['y_column', 'y_column', 'y_column', 'y_column']
}
df = pd.DataFrame(data)
condition_list = [df['Country'] == 'CA']
choice_list = [df['x']]
df['new'] = np.select(condition_list, choice_list, df['y'])
df

Your np.where() looked fine though so I would double check that your columns are labeled correctly.

ArchAngelPwn
  • 2,891
  • 1
  • 4
  • 17