Pandas "diff()" with string

Question

How can I flag a row in a dataframe every time a column change its string value?

Ex:

Input

ColumnA   ColumnB
1            Blue
2            Blue
3            Red
4            Red
5            Yellow


#  diff won't work here with strings....  only works in numerical values
dataframe['changed'] = dataframe['ColumnB'].diff()        


ColumnA   ColumnB      changed
1            Blue         0
2            Blue         0
3            Red          1
4            Red          0
5            Yellow       1

Performance note: It might be better to simply use `np.bool` type instead of integers. `np.bool` takes up a single byte. I suppose you could use `np.int8` but by default `np.int64` or `np.int64` (whatever a C long is on your system) is used, I believe... — juanpa.arrivillaga, Oct 31 '16 at 18:58

root · Accepted Answer · 2016-10-31T19:18:22.027

36

I get better performance with ne instead of using the actual != comparison:

df['changed'] = df['ColumnB'].ne(df['ColumnB'].shift().bfill()).astype(int)

Timings

Using the following setup to produce a larger dataframe:

df = pd.concat([df]*10**5, ignore_index=True)

I get the following timings:

%timeit df['ColumnB'].ne(df['ColumnB'].shift().bfill()).astype(int)
10 loops, best of 3: 38.1 ms per loop

%timeit (df.ColumnB != df.ColumnB.shift()).astype(int)
10 loops, best of 3: 77.7 ms per loop

%timeit df['ColumnB'] == df['ColumnB'].shift(1).fillna(df['ColumnB'])
10 loops, best of 3: 99.6 ms per loop

%timeit (df.ColumnB.ne(df.ColumnB.shift())).astype(int)
10 loops, best of 3: 19.3 ms per loop

edited Oct 31 '16 at 19:18

answered Oct 31 '16 at 19:06

root

32,715
6
74
87

1

Please can you add timings for `(df.ColumnB.ne(df.ColumnB.shift())).astype(int)` ? – jezrael Oct 31 '16 at 19:14
@jezrael: Added the timing. Using `ix` to make the first row 0 adds ~1 ms to the timing, so it looks to be fastest that way. – root Oct 31 '16 at 19:19
Hi, i am using this answer in my script but it returned me 'SettingWithCopyWarning', do you guys see that? dff['changed'] = dff.col1.ne(dff.col1.shift(1)) – user466130 Aug 11 '18 at 04:00
@root How do i get the shift of the state count? that is `Blue -> Red` , `Red -> Yellow` in the same sequence as the were detected – Santhosh Dhaipule Chandrakanth Aug 19 '19 at 14:18
@root Can i directly know the change in state from `Blue` to `Yellow` in spite of having `Red` in the middle? – Santhosh Dhaipule Chandrakanth Aug 19 '19 at 14:25

Kartik · Answer 2 · 2016-10-31T18:50:55.430

10

Use .shift and compare:

dataframe['changed'] = dataframe['ColumnB'] == dataframe['ColumnB'].shift(1).fillna(dataframe['ColumnB'])

edited Oct 31 '16 at 18:50

answered Oct 31 '16 at 18:47

Kartik

8,347
39
73

very clean answer – guilhermecgs Oct 31 '16 at 18:50

score 7 · Answer 3 · edited May 23 '17 at 12:02

7

For me works compare with shift, then NaN was replaced 0 because before no value:

df['diff'] = (df.ColumnB != df.ColumnB.shift()).astype(int)
df.ix[0,'diff'] = 0
print (df)
   ColumnA ColumnB  diff
0        1    Blue     0
1        2    Blue     0
2        3     Red     1
3        4     Red     0
4        5  Yellow     1

Edit by timings of another answer - fastest is use ne:

df['diff'] = (df.ColumnB.ne(df.ColumnB.shift())).astype(int)
df.ix[0,'diff'] = 0

edited May 23 '17 at 12:02

Community

1
1

answered Oct 31 '16 at 18:49

jezrael

822,522
95
1,334
1,252

1

I wonder, is there a performance difference between this approach and simply using `!=`? – juanpa.arrivillaga Oct 31 '16 at 18:50
1

@jezrael That how to do the same thing based on two columns? – User7777 Oct 09 '19 at 23:23
2

@Navroop - do you think `df[['ColumnA','ColumnB']].ne(df[['ColumnA','ColumnB']].shift()).any(axis=1).astype(int)` ? – jezrael Oct 10 '19 at 05:08

Pandas "diff()" with string

3 Answers3

Linked