18

Consider this Dataframe:

df = pd.DataFrame({'A': [1, 1, 2, 2, 3, 3],
                   'B': [10, 15, 20, 25, 30,35],
                   'C': [100, 150, 200, 250, 300, 350]})

This is the code to get values of column C, where it is the first row of each group (Column A):

firsts = df.groupby('A').first()['C']

So first will be: (100, 200, 300).

Now I want to add new column which it will be 1 if value of column C for row is in firsts otherwise it will be 0.

A B C D
1 10 100 1
1 15 150 0
2 20 200 1
2 25 250 0
3 30 300 1
3 35 350 0

I used this:

df['D'] = df['C'].apply(lambda x: 1 if x in firsts else 0)

But the output is:

A B C D
1 10 100 0
1 15 150 0
2 20 200 0
2 25 250 0
3 30 300 0
3 35 350 0

I appreciate if anyone explain why my solution is wrong and what is actual solution to this problem?

Kaveh
  • 4,618
  • 2
  • 20
  • 33
Mehrdad Salimi
  • 1,400
  • 4
  • 16
  • 31

4 Answers4

25

You can use isin method:

df['D'] = df.C.isin(firsts).astype(int)

df
#   A   B   C   D
#0  1   10  100 1
#1  1   15  150 0
#2  2   20  200 1
#3  2   25  250 0
#4  3   30  300 1
#5  3   35  350 0

The reason your approach fails is that python in operator check the index of a Series instead of the values, the same as how a dictionary works:

firsts
#A
#1    100
#2    200
#3    300
#Name: C, dtype: int64

1 in firsts
# True

100 in firsts
# False

2 in firsts
# True

200 in firsts
# False

Modifying your method as follows works:

firstSet = set(firsts)
df['C'].apply(lambda x: 1 if x in firstSet else 0)

#0    1
#1    0
#2    1
#3    0
#4    1
#5    0
#Name: C, dtype: int64
Psidom
  • 209,562
  • 33
  • 339
  • 356
5

TL;DR:

df['newColumn'] = np.where((df.compareColumn.isin(yourlist)), TrueValue, FalseValue)

Another one-step method would be to use np.where() and isin.

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 1, 2, 2, 3, 3],
                   'B': [10, 15, 20, 25, 30,35],
                   'C': [100, 150, 200, 250, 300, 350]})

df['D'] = np.where((df.B.isin(firsts)), 1, 0)

We use the return from isin as the condition in np.where() to return either

  • 1 when True
  • 0 when False

and assign them to a new column in the same dataframe df['D'].

Note: np.where allows more complex conditions with bitwise operators and replacement cases, i.e. 'bypass' on False

df['col1'] = np.where(((df['col1'] == df['col2']) &
                       (~df['col1'].str.startswith('r'))),
                       'replace', df['col1'])

fielc92
  • 815
  • 10
  • 12
1

You can also do it in one step using .transform('first'):

In [280]: df['D'] = df.groupby('A')['C'].transform('first').eq(df['C']).astype(np.int8)

In [281]: df
Out[281]:
   A   B    C  D
0  1  10  100  1
1  1  15  150  0
2  2  20  200  1
3  2  25  250  0
4  3  30  300  1
5  3  35  350  0

Explanation: GroupBy.transform('func') returns us a vector with the same length as the original DF with applied func

In [14]: df.groupby('A')['C'].transform('first')
Out[14]:
0    100
1    100
2    200
3    200
4    300
5    300
Name: C, dtype: int64

In [15]: df.groupby('A')['C'].transform('max')
Out[15]:
0    150
1    150
2    250
3    250
4    350
5    350
Name: C, dtype: int64

In [16]: df.groupby('A')['C'].transform('min')
Out[16]:
0    100
1    100
2    200
3    200
4    300
5    300
Name: C, dtype: int64

In [17]: df.groupby('A')['C'].transform('mean')
Out[17]:
0    125
1    125
2    225
3    225
4    325
5    325
Name: C, dtype: int64

In [18]: df.groupby('A')['C'].transform('sum')
Out[18]:
0    250
1    250
2    450
3    450
4    650
5    650
Name: C, dtype: int64
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
-1

firsts is pandas series , so when we use in to search for value then it will search that value in index list to solve this we can convert firsts to list or array

%timeit df['D'] = df['C'].apply(lambda x: 1 if x in firsts.values else 0)

314 µs ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

or

%timeit df['D'] = df['C'].apply(lambda x: 1 if x in list(firsts) else 0)

301 µs ± 11.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

or

%timeit df['D'] = list(map(lambda x: 1 if x in list(firsts) else 0,list(df['C'])))

27.6 µs ± 1.02 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Pujit
  • 11
  • 2