2

I assumed this would be fairly straight-forward, but apparently I'm missing something here.

I want to be able to utilize np.where with df.groupby('Name').apply() to create a new column in df (call it 'New'), where the values of the column are 1 if the indices of the respective group (indices corresponding to the original df) are greater than or equal to (>=) a particular value, else 0.

For background, I am grouping df by the 'Name' column and I have a dict() that contains the corresponding value to use for each name from the groupby(). I hope that is clear, I can provide further clarification if necessary.

Here is what I have so far, given sample df:

df = pd.DataFrame([['William', 1, 0, 0, 0, 1],['James', 0, 1, 1, 1, 1],['James', 1, 0, 0, 0, 0],
                ['James', 1, 0, 1, 1, 0],['William', 0, 1, 1, 0, 1],['William', 0, 0, 0, 0, 0],
                ['William', 1, 0, 1, 1, 0],['James', 0, 1, 1, 0, 1],['James', 0, 0, 0, 0, 0]],
                columns=['Name','x1','x2','x3','x4','Interest'])

       Name  x1  x2  x3  x4  Interest
0  William   1   0   0   0         1
1    James   0   1   1   1         1
2    James   1   0   0   0         0
3    James   1   0   1   1         0
4  William   0   1   1   0         1
5  William   0   0   0   0         0
6  William   1   0   1   1         0
7    James   0   1   1   0         1
8    James   0   0   0   0         0

Then I am finding the last row in df for each group where the 'Interest' column has a 1, using:

mydict = df[df['Interest']==1].groupby('Name').apply(lambda x: x.index[-1]).to_dict()

{'James': 7, 'William': 4}

Note: This is a simplified example. For my actual application, I am pulling the index of the 3rd to last row (i.e. .apply(lambda x: x.index[-3]).to_dict()), however the next part is where the root of my question lies.

Now, I want to create a new column 'Name', where the value is 1 if the row index is >= the value in mydict for that group, else 0. I've tried a few things:

for key, val in mydict.items():
    df['New'] = np.where((df['Name']==key) & (df.index>=val), 1, 0)

This obviously will override anything done for 'James' and just return the correct column for 'William'. How can I efficiently do this?

To be thorough, here is my expected output:

      Name  x1  x2  x3  x4  Interest  New
0  William   1   0   0   0         1    0
1    James   0   1   1   1         1    0
2    James   1   0   0   0         0    0
3    James   1   0   1   1         0    0
4  William   0   1   1   0         1    1
5  William   0   0   0   0         0    1
6  William   1   0   1   1         0    1
7    James   0   1   1   0         1    1
8    James   0   0   0   0         0    1
Stephen Rauch
  • 47,830
  • 31
  • 106
  • 135
rahlf23
  • 8,869
  • 4
  • 24
  • 54

2 Answers2

3

Use map

df.assign(New=(df.index >= df.Name.map(mydict)).astype(int))

      Name  x1  x2  x3  x4  Interest  New
0  William   1   0   0   0         1    0
1    James   0   1   1   1         1    0
2    James   1   0   0   0         0    0
3    James   1   0   1   1         0    0
4  William   0   1   1   0         1    1
5  William   0   0   0   0         0    1
6  William   1   0   1   1         0    1
7    James   0   1   1   0         1    1
8    James   0   0   0   0         0    1
piRSquared
  • 285,575
  • 57
  • 475
  • 624
2

Use list comprehension for all masks and then reduce them to one, last convert it to integer - Trues are 1s:

m = [((df['Name']==key) & (df.index>=val)) for key, val in mydict.items()]
print (m)
[0    False
1    False
2    False
3    False
4    False
5    False
6    False
7     True
8     True
Name: Name, dtype: bool, 0    False
1    False
2    False
3    False
4     True
5     True
6     True
7    False
8    False
Name: Name, dtype: bool]

df['New'] = np.logical_or.reduce(m).astype(int)
print (df)
      Name  x1  x2  x3  x4  Interest  New
0  William   1   0   0   0         1    0
1    James   0   1   1   1         1    0
2    James   1   0   0   0         0    0
3    James   1   0   1   1         0    0
4  William   0   1   1   0         1    1
5  William   0   0   0   0         0    1
6  William   1   0   1   1         0    1
7    James   0   1   1   0         1    1
8    James   0   0   0   0         0    1

EDIT:

Another solution for this problem:

df = pd.concat([df] * 2, ignore_index=True)

Get index of first True value per condition - count 3rd value from back

idx = df[df['Interest']==1].groupby('Name').cumcount(ascending=False).eq(2).idxmax()

Set values from idx to end to 1:

df['New'] = 0
df.loc[idx:, 'New'] = 1
print (df)
       Name  x1  x2  x3  x4  Interest  New
0   William   1   0   0   0         1    0
1     James   0   1   1   1         1    0
2     James   1   0   0   0         0    0
3     James   1   0   1   1         0    0
4   William   0   1   1   0         1    1
5   William   0   0   0   0         0    1
6   William   1   0   1   1         0    1
7     James   0   1   1   0         1    1
8     James   0   0   0   0         0    1
9   William   1   0   0   0         1    1
10    James   0   1   1   1         1    1
11    James   1   0   0   0         0    1
12    James   1   0   1   1         0    1
13  William   0   1   1   0         1    1
14  William   0   0   0   0         0    1
15  William   1   0   1   1         0    1
16    James   0   1   1   0         1    1
17    James   0   0   0   0         0    1

Detail:

print (df[df['Interest']==1].groupby('Name').cumcount(ascending=False))
0     3
1     3
4     2
7     2
9     1
10    1
13    0
16    0
dtype: int64
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • This solution worked just as I needed, however the solution from @piRSquared avoids the need for the list comprehension. I truly appreciate your answer, thanks! – rahlf23 Jun 29 '18 at 05:18
  • @rahlf23 - I find beter solution, please check edited answer. Only need index of first True value. – jezrael Jun 29 '18 at 05:37