0

I need to make a column in my pandas dataframe that relies on other items in that same row. For example, here's my dataframe.

    df = pd.DataFrame(
        [['a',],['a',1],['a',1],['a',2],['b',2],['b',2],['c',3]],
        columns=['letter','number']
    )
   letters  numbers
 0    a     1
 1    a     1
 2    a     1
 3    a     2
 4    b     2
 5    b     2
 6    c     3

I need a third column, that is 1 if 'a' and 2 are present in the row, and 0 otherwise. So it would be [`0,0,0,1,0,0,0]`

How can I use Pandas `apply` or `map` to do this? Iterating over the rows is my first thought, but this seems like a clumsy way of doing it.
tawab_shakeel
  • 3,701
  • 10
  • 26
max
  • 4,141
  • 5
  • 26
  • 55
  • 1
    If it's just that simple condition, you don't need `apply` here. `df['new_column'] = ((df['letters'] == "a") & (df['numbers'] == 2)).astype(int)` – pault Nov 26 '18 at 20:10
  • This makes sense, but for even 3 or 4 columns with a condition, this would get unwieldy. Are there any alternatives? – max Nov 26 '18 at 20:16
  • There are alternatives, have you looked through the documentation? Your best bet is to try something and see if it fits your needs. – wwii Nov 26 '18 at 20:17
  • @max whether it be via `apply` or using boolean conditions, it will be about equally unwieldy (code-wise) but the latter will be much faster. – pault Nov 26 '18 at 20:24

2 Answers2

2

You can use apply with axis=1. Suppose you wanted to call your new column c:

df['c'] = df.apply(
    lambda row: (row['letter'] == 'a') and (row['number'] == 2),
    axis=1
).astype(int)

print(df)
#  letter  number  c
#0      a     NaN  0
#1      a     1.0  0
#2      a     1.0  0
#3      a     2.0  1
#4      b     2.0  0
#5      b     2.0  0
#6      c     3.0  0

But apply is slow and should be avoided if possible. In this case, it would be much better to boolean logic operations, which are vectorized.

df['c'] = ((df['letter'] == "a") & (df['number'] == 2)).astype(int)

This has the same result as using apply above.

pault
  • 41,343
  • 15
  • 107
  • 149
1

You can try to use pd.Series.where()/np.where(). If you only are interested in the int represantation of the boolean values, you can pick the other solution. If you want more freedom for the if/else value you can use np.where()

import pandas as pd
import numpy as np

# create example
values = ['a', 'b', 'c']
df = pd.DataFrame()
df['letter'] = np.random.choice(values, size=10)
df['number'] = np.random.randint(1,3, size=10)

# condition
df['result'] = np.where((df['letter'] == 'a') & (df['number'] == 2), 1, 0)
MisterMonk
  • 327
  • 1
  • 9