Iterating over rows in pandas to check the condition

Question

I have the following DF in pandas.

+-------+-------+
| Col_A | Col_B |
+-------+-------+
|  1234 |       |
|  6267 |       |
|  6364 |       |
|   573 |       |
|     0 |       |
|   838 |       |
|    92 |       |
|  3221 |       |
+-------+-------+

Col_B should either be filled with True or False values. By default, it is False, but when the first 0 has been "seen", the rest of DF should be True. The DF has over 100 000 rows.

What will be the fastest way to set values in col_B equal to "True" since the first "0" value in Col_A appears?

+-------+--------+
| Col_A | Col_B  |
+-------+--------+
|  1234 | False  |
|  6267 | False  |
|  6364 | False  |
|   573 | False  |
|     0 | True   |
|   838 | True   |
|    92 | True   |
|  3221 | True   |
+-------+--------+

score 5 · Answer 1 · answered Aug 27 '18 at 15:15

Using `idxmax` with `loc` for assignment

idx = df.Col_A.eq(0).idxmax()
df['Col_B'] = False
df.loc[idx:, 'Col_B'] = True

   Col_A  Col_B
0   1234  False
1   6267  False
2   6364  False
3    573  False
4      0   True
5    838   True
6     92   True
7   3221   True

Using `assign`:

This approach avoids modifying the original DataFrame.

df.assign(Col_B=(df.index >= idx))

score 5 · Answer 2 · answered Aug 27 '18 at 15:28

5

Using eq with cummax

df.A.eq(0).cummax()
Out[5]: 
0    False
1    False
2    False
3    False
4     True
5     True
6     True
7     True
Name: A, dtype: bool

answered Aug 27 '18 at 15:28

BENY

317,841
20
164
234

score 5 · Answer 3 · answered Aug 27 '18 at 15:32

You can use Numpy's accumulate method of the ufunc logical_or

df.assign(Col_B=np.logical_or.accumulate(df.Col_A.values == 0))

   Col_A  Col_B
0   1234  False
1   6267  False
2   6364  False
3    573  False
4      0   True
5    838   True
6     92   True
7   3221   True

jpp · Accepted Answer · 2018-08-27T15:24:25.993

4

You can use next with a generator expression. This will be more efficient in the case of a large series where 0 appears near the beginning.

@user3483203's NumPy-based solution should be fine for general use.

df = pd.DataFrame({'A': [1234, 6267, 6364, 573, 0, 838, 92, 3221]})

idx = next((i for i, j in enumerate(df['A']) if j == 0), len(df['A']))

df['B'] = ~(df.index < idx)

# more verbose alternative:
# df['B'] = np.where(df.index < idx, False, True)

print(df)

      A      B
0  1234  False
1  6267  False
2  6364  False
3   573  False
4     0   True
5   838   True
6    92   True
7  3221   True

edited Aug 27 '18 at 15:24

answered Aug 27 '18 at 15:21

jpp

159,742
34
281
339

1

I like the generator approach for the short-circuiting, I think `np.where` adds unneeded overhead when `~(df.index < idx)` could be used. – user3483203 Aug 27 '18 at 15:23
While running second line, I have KeyError: 'A' – Pinky the mouse Aug 27 '18 at 15:26
@Pinkythemouse, Are you copying the above code *precisely*. Looks like the dataframe you are using doesn't have a series labelled `A`. – jpp Aug 27 '18 at 15:26
Thanks, solved, seems because of typo my "A" symbol was not latin letter – Pinky the mouse Aug 27 '18 at 15:43

ysearka · Answer 5 · 2018-08-27T16:10:40.737

Many methods have been presented here, and I couldn't resist. I had to perform a little performance comparison between those:

%timeit vivek_kumar()
16.6 ms ± 495 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit numbered_user()
6.69 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit warpri()
14 ms ± 216 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit jpp()
2.21 ms ± 96.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit wen()
991 µs ± 20.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit pirsquared()
938 µs ± 24.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

This comparison has been performed on a 80k lines dataframe in order to test the scalability. It would appear that wen and piRsquared solutions are the most efficient. Do not hesitate to upvote these answers.

EDIT: for transparency here are the functions used to perform the test:

def vivek_kumar():
    data = df.copy()
    first_index = data.loc[data['Col_A'] == 0, 'Col_A'].index[0]
    data.loc[:first_index, 'Col_B'] = False
    data.loc[first_index:, 'Col_B'] = True

def numbered_user():
    data = df.copy()
    idx = data.Col_A.eq(0).idxmax()
    data['Col_B'] = False
    data.loc[idx:, 'Col_B'] = True

def warpri():
    data = df.copy()
    def update_col_b(col_a):
        return col_a == 0
    data['Col_B'] = data.Col_A.apply(update_col_b)

def jpp():
    data = df.copy()
    idx = next((i for i, j in enumerate(data['Col_A']) if j == 0), len(data['Col_A']))
    data['Col_B'] = ~(data.index < idx)

def wen():
    data = df.copy()
    data['Col_B'] = data.Col_A.eq(0).cummax()

def pirsquared():
    data = df.copy()
    # This would return a copy.  My preferred approach
    # return data.assign(Col_B=np.logical_or.accumulate(data.Col_A.values == 0))
    # This edits the dataframe in place but properly compares against the other proposals
    df['Col_B'] = np.logical_or.accumulate(data.Col_A.values == 0)

EDIT2: Following piRSquared indications, here is also a comparison between using assign to produce a copy of the dataframe and using = to modify the existing dataframe:

def pirsquared1():
    data = df.copy()
    data = data.assign(Col_B=np.logical_or.accumulate(data.Col_A.values == 0))

%timeit pirsquared1()
923 µs ± 32.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

def pirsquared2():
    data = df.copy()
    df['Col_B'] = np.logical_or.accumulate(data.Col_A.values == 0)

%timeit pirsquared2()
598 µs ± 35.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Nice contribution. Double check that you are comparing apples to apples. What I presented was an assignment to the dataframe and @Wen presented a Series. What I'm suggesting is that the assignment of the series to the dataframe has some overhead that may not be accounted for in your test. — piRSquared, Aug 27 '18 at 15:43
Yeah, that makes sense. When some answers choose to produce copies of dataframes versus adding (or overwriting) a column in the existing dataframe, that is a stylistic choice (most of the time). In order to compare, you'll want to put all of them on even footing. You can either change mine to edit in place or change Wen's (and the rest) to produce copies. — piRSquared, Aug 27 '18 at 15:49
I added the functions I used for reproducibility. Actually I forgot to perform your assignment and corrected it. Note that on a dataframe this 'small', it doesn't impact the performance (computationally speaking, it might have an impact on space complexity I guess). — ysearka, Aug 27 '18 at 15:51
I'll edit your post real quick to show what I mean. By all means, re-edit how you see fit.... Done. — piRSquared, Aug 27 '18 at 15:52
Thanks for the editing, your point is somehow confusing me though. I think we don't have the same version of `pandas`, in mine the assignment isn't performed inplace for some reason (I am using pandas 0.22.0). — ysearka, Aug 27 '18 at 16:00
My preferred approach is to produce a copy of a dataframe with the new or updated column. This is what `assign` does. It doesn't alter the calling dataframe. Or rather, it doesn't operate in-place. I choose to do this because I try to avoid side effects (i'm not perfect at this). The alternative choice would've been to set the column with the assignment operator `=`. Which is what is done `df['Col_B'] = some_values`. That is in-place and is quicker. It is quicker because you avoid the overhead of creating a new dataframe object. — piRSquared, Aug 27 '18 at 16:04
Ok I think I see what you meant now, I tried and follow this and made an edit, and it is indeed even faster than previously, so you are now officially `the most efficient`! (that should be worth a badge). — ysearka, Aug 27 '18 at 16:12

Vivek Kumar · Answer 6 · 2018-08-27T15:27:24.737

0

Find the index of first 0 in the col_A

first_index = df['col_A'][df['col_A'] == 0].index[0] - 1  #-1 to get index before 0

Recommended way (Thanks @jpp):

first_index = df.loc[df['col_A'] == 0, 'col_A'].index[0] - 1

Then, use it to fill other column:

df.loc[:first_index, 'col_B'] = False
df.loc[first_index:, 'col_B'] = True

edited Aug 27 '18 at 15:27

answered Aug 27 '18 at 15:16

Vivek Kumar

35,217
8
109
132

2

Chained indexing is explicitly discouraged in the docs. Use `loc` instead. – jpp Aug 27 '18 at 15:18
@jpp Oh. Thanks. I was using [this answer as reference](https://stackoverflow.com/a/18327852/3374996). Can you please suggest the appropriate way? – Vivek Kumar Aug 27 '18 at 15:25
1

Sure, `df.loc[df['col_A'] == 0, 'col_A'].index[0]`. – jpp Aug 27 '18 at 15:25

Iterating over rows in pandas to check the condition

6 Answers6

Using idxmax with loc for assignment

Using assign:

Using `idxmax` with `loc` for assignment

Using `assign`: