7

I have the following DF in pandas.

+-------+-------+
| Col_A | Col_B |
+-------+-------+
|  1234 |       |
|  6267 |       |
|  6364 |       |
|   573 |       |
|     0 |       |
|   838 |       |
|    92 |       |
|  3221 |       |
+-------+-------+

Col_B should either be filled with True or False values. By default, it is False, but when the first 0 has been "seen", the rest of DF should be True. The DF has over 100 000 rows.

What will be the fastest way to set values in col_B equal to "True" since the first "0" value in Col_A appears?

+-------+--------+
| Col_A | Col_B  |
+-------+--------+
|  1234 | False  |
|  6267 | False  |
|  6364 | False  |
|   573 | False  |
|     0 | True   |
|   838 | True   |
|    92 | True   |
|  3221 | True   |
+-------+--------+
Pinky the mouse
  • 197
  • 1
  • 4
  • 10

6 Answers6

5

Using idxmax with loc for assignment

idx = df.Col_A.eq(0).idxmax()
df['Col_B'] = False
df.loc[idx:, 'Col_B'] = True

   Col_A  Col_B
0   1234  False
1   6267  False
2   6364  False
3    573  False
4      0   True
5    838   True
6     92   True
7   3221   True

Using assign:

This approach avoids modifying the original DataFrame.

df.assign(Col_B=(df.index >= idx))
user3483203
  • 50,081
  • 9
  • 65
  • 94
5

Using eq with cummax

df.A.eq(0).cummax()
Out[5]: 
0    False
1    False
2    False
3    False
4     True
5     True
6     True
7     True
Name: A, dtype: bool
BENY
  • 317,841
  • 20
  • 164
  • 234
5

You can use Numpy's accumulate method of the ufunc logical_or

df.assign(Col_B=np.logical_or.accumulate(df.Col_A.values == 0))

   Col_A  Col_B
0   1234  False
1   6267  False
2   6364  False
3    573  False
4      0   True
5    838   True
6     92   True
7   3221   True
piRSquared
  • 285,575
  • 57
  • 475
  • 624
4

You can use next with a generator expression. This will be more efficient in the case of a large series where 0 appears near the beginning.

@user3483203's NumPy-based solution should be fine for general use.

df = pd.DataFrame({'A': [1234, 6267, 6364, 573, 0, 838, 92, 3221]})

idx = next((i for i, j in enumerate(df['A']) if j == 0), len(df['A']))

df['B'] = ~(df.index < idx)

# more verbose alternative:
# df['B'] = np.where(df.index < idx, False, True)

print(df)

      A      B
0  1234  False
1  6267  False
2  6364  False
3   573  False
4     0   True
5   838   True
6    92   True
7  3221   True
jpp
  • 159,742
  • 34
  • 281
  • 339
3

Many methods have been presented here, and I couldn't resist. I had to perform a little performance comparison between those:

%timeit vivek_kumar()
16.6 ms ± 495 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit numbered_user()
6.69 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit warpri()
14 ms ± 216 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit jpp()
2.21 ms ± 96.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit wen()
991 µs ± 20.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit pirsquared()
938 µs ± 24.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

This comparison has been performed on a 80k lines dataframe in order to test the scalability. It would appear that wen and piRsquared solutions are the most efficient. Do not hesitate to upvote these answers.

EDIT: for transparency here are the functions used to perform the test:

def vivek_kumar():
    data = df.copy()
    first_index = data.loc[data['Col_A'] == 0, 'Col_A'].index[0]
    data.loc[:first_index, 'Col_B'] = False
    data.loc[first_index:, 'Col_B'] = True

def numbered_user():
    data = df.copy()
    idx = data.Col_A.eq(0).idxmax()
    data['Col_B'] = False
    data.loc[idx:, 'Col_B'] = True

def warpri():
    data = df.copy()
    def update_col_b(col_a):
        return col_a == 0
    data['Col_B'] = data.Col_A.apply(update_col_b)

def jpp():
    data = df.copy()
    idx = next((i for i, j in enumerate(data['Col_A']) if j == 0), len(data['Col_A']))
    data['Col_B'] = ~(data.index < idx)

def wen():
    data = df.copy()
    data['Col_B'] = data.Col_A.eq(0).cummax()

def pirsquared():
    data = df.copy()
    # This would return a copy.  My preferred approach
    # return data.assign(Col_B=np.logical_or.accumulate(data.Col_A.values == 0))
    # This edits the dataframe in place but properly compares against the other proposals
    df['Col_B'] = np.logical_or.accumulate(data.Col_A.values == 0)

EDIT2: Following piRSquared indications, here is also a comparison between using assign to produce a copy of the dataframe and using = to modify the existing dataframe:

def pirsquared1():
    data = df.copy()
    data = data.assign(Col_B=np.logical_or.accumulate(data.Col_A.values == 0))

%timeit pirsquared1()
923 µs ± 32.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

def pirsquared2():
    data = df.copy()
    df['Col_B'] = np.logical_or.accumulate(data.Col_A.values == 0)

%timeit pirsquared2()
598 µs ± 35.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
ysearka
  • 3,805
  • 5
  • 20
  • 41
  • 1
    Nice contribution. Double check that you are comparing apples to apples. What I presented was an assignment to the dataframe and @Wen presented a Series. What I'm suggesting is that the assignment of the series to the dataframe has some overhead that may not be accounted for in your test. – piRSquared Aug 27 '18 at 15:43
  • Yeah, that makes sense. When some answers choose to produce copies of dataframes versus adding (or overwriting) a column in the existing dataframe, that is a stylistic choice (most of the time). In order to compare, you'll want to put all of them on even footing. You can either change mine to edit in place or change Wen's (and the rest) to produce copies. – piRSquared Aug 27 '18 at 15:49
  • I added the functions I used for reproducibility. Actually I forgot to perform your assignment and corrected it. Note that on a dataframe this 'small', it doesn't impact the performance (computationally speaking, it might have an impact on space complexity I guess). – ysearka Aug 27 '18 at 15:51
  • I'll edit your post real quick to show what I mean. By all means, re-edit how you see fit.... Done. – piRSquared Aug 27 '18 at 15:52
  • Thanks for the editing, your point is somehow confusing me though. I think we don't have the same version of `pandas`, in mine the assignment isn't performed inplace for some reason (I am using pandas 0.22.0). – ysearka Aug 27 '18 at 16:00
  • My preferred approach is to produce a copy of a dataframe with the new or updated column. This is what `assign` does. It doesn't alter the calling dataframe. Or rather, it doesn't operate in-place. I choose to do this because I try to avoid side effects (i'm not perfect at this). The alternative choice would've been to set the column with the assignment operator `=`. Which is what is done `df['Col_B'] = some_values`. That is in-place and is quicker. It is quicker because you avoid the overhead of creating a new dataframe object. – piRSquared Aug 27 '18 at 16:04
  • Ok I think I see what you meant now, I tried and follow this and made an edit, and it is indeed even faster than previously, so you are now officially `the most efficient`! (that should be worth a badge). – ysearka Aug 27 '18 at 16:12
0

Find the index of first 0 in the col_A

first_index = df['col_A'][df['col_A'] == 0].index[0] - 1  #-1 to get index before 0

Recommended way (Thanks @jpp):

first_index = df.loc[df['col_A'] == 0, 'col_A'].index[0] - 1

Then, use it to fill other column:

df.loc[:first_index, 'col_B'] = False
df.loc[first_index:, 'col_B'] = True
Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
  • 2
    Chained indexing is explicitly discouraged in the docs. Use `loc` instead. – jpp Aug 27 '18 at 15:18
  • @jpp Oh. Thanks. I was using [this answer as reference](https://stackoverflow.com/a/18327852/3374996). Can you please suggest the appropriate way? – Vivek Kumar Aug 27 '18 at 15:25
  • 1
    Sure, `df.loc[df['col_A'] == 0, 'col_A'].index[0]`. – jpp Aug 27 '18 at 15:25