Count of unique rows based on preceding row - Pandas

Question

I want to return unique rows from multiple columns in a df. The issue is I want to include the same set of values if they don't appear in the previous row. This is a little hard to explain so I'll display it;

df = pd.DataFrame({   
        'Time' : ['2019-08-02 09:50:10.1','2019-08-02 09:50:10.2','2019-08-02 09:50:10.3','2019-08-02 09:50:10.4','2019-08-02 09:50:10.5','2019-08-02 09:50:10.6','2019-08-02 09:50:10.7','2019-08-02 09:50:10.8','2019-08-02 09:50:10.9','2019-08-02 09:50:11.0'],
        'Code1' : ['A','A','B','B','C','C','A','A','B','B'],    
        'Code2' : ['B','B','A','A','B','B','B','B','A','A'],    
        'Code3' : [np.nan,np.nan,'C','C','A','A','C','C','C','C'],                                   
        })


df = df[df.iloc[:, 1:].shift().ne(df.iloc[:, 1:]).any(axis=1)].reset_index(drop = True)

Intended Output:

                    Time Code1 Code2 Code3
0  2019-08-02 09:50:10.1     A     B   NaN
1  2019-08-02 09:50:10.3     B     A     C
2  2019-08-02 09:50:10.5     C     B     A
3  2019-08-02 09:50:10.7     A     B     C
4  2019-08-02 09:50:10.9     B     A     C

score 1 · Answer 1 · answered Nov 12 '19 at 00:26

First we use iloc to select the correct columns, then we use shiftto check if current row is not equal to the next one. Finally we use any over axis=1 (columns). Because A B C and B A C are different, but have C in common:

df[df.iloc[:, 1:].shift().ne(df.iloc[:, 1:]).any(axis=1)]

Or the same, but written down little less concise:

mask = df.iloc[:, 1:].shift().ne(df.iloc[:, 1:])
df[mask.any(axis=1)]

                    Time Code1 Code2 Code3
0  2019-08-02 09:50:10.1     A     B     C
2  2019-08-02 09:50:10.3     B     A     C
4  2019-08-02 09:50:10.5     C     B     A
6  2019-08-02 09:50:10.7     A     B     C
8  2019-08-02 09:50:10.9     B     A     C

Thanks @Erfan, what is the df contained NaN values though? I'll update the sample df. — jonboy, Nov 12 '19 at 00:46

score 1 · Accepted Answer · answered Nov 12 '19 at 01:17

You can use a sentinel value for the nulls, then group on the result.

The groupby is a variant of the shift-cumsum pattern.

sentinel = 99999999
mask = df[cols].fillna(sentinel).ne(df[cols].fillna(sentinel).shift())
>>> df.groupby(
        mask.any(axis=1).cumsum(), 
        sort=False,
        as_index=False
    ).first()
                    Time Code1 Code2 Code3
0  2019-08-02 09:50:10.1     A     B   NaN
1  2019-08-02 09:50:10.3     B     A     C
2  2019-08-02 09:50:10.5     C     B     A
3  2019-08-02 09:50:10.7     A     B     C
4  2019-08-02 09:50:10.9     B     A     C

Count of unique rows based on preceding row - Pandas

2 Answers2