There are a couple of ways to find out the difference between two rows in a dataframe.
Option 1:
for col in df.columns[2:]: #check column by column from 3rd column thru the end
if df[col].nunique() > 1:
print (col, df[col].unique())
This will print all the columns that have different values. If nunique()
is more than 1, then there is more than one value in that column. However, this will NOT tell you which row has the difference.
Option 2:
An alternate way is to do a df[col].shift()
and compare against the previous row. If there is a difference, then keep tab of that. Do the same comparison for each row. Consolidate all the differences and you will get a list of all rows that have at least one value different between the next row.
To do this, you can do as follows:
import pandas as pd
df = pd.DataFrame({'col1':[1,1,1,1,1,1,1,1],
'col2':[2,2,2,2,2,2,2,2],
'col3':[3,3,3,3,3,3,3,4],
'col4':[4,4,4,4,4,4,5,4],
'col5':[5,5,5,5,5,1,5,5]})
print (df)
df['Differs'] = False #set all rows to matched
for col in df.columns[2:]: #check column by column from 3rd column thru the end
#if df[col].nunique() > 1:
#print (col, df[col].unique())
df['newcol1'] = df[col].shift() != df[col] #check against next row. True if differs
df.loc[:0,'newcol1'] = False # Tweak the first row as it should be ignored
df.loc[df['newcol1'] == True,'Differs'] = True #if any row matched, set Differs to True
print (df[df['Differs']]) #print all rows that has a different value in at least one column
In the above example, rows 3, 4, 5 have at least one value different from the previous row.
For the below given dataframe:
col1 col2 col3 col4 col5
0 1 2 3 4 5
1 1 2 3 4 5
2 1 2 3 4 5
3 1 2 3 4 5
4 1 2 3 4 5
5 1 2 3 4 1
6 1 2 3 5 5
7 1 2 4 4 5
The output will be:
col1 col2 col3 col4 col5 Differs newcol1
5 1 2 3 4 1 True True
6 1 2 3 5 5 True False
7 1 2 4 4 5 True False
Using the two techniques, I ran the comparison.
Option 1 result:
All these columns have more than one value. The values are in the list next to column name.
AFreezeSetP [-39.666669 -39.333334]
AFreezeUserT_1K [23 19]
UserPmp1_On [0 1]
RunTempRegKp [3.111111 3.444445]
RunTempRegTi [399 398]
RunTempRegTd [99 96]
RegSetP [-27.333334 -26.888891]
CoolSetP [-27.333334 -26.888891]
AFreezeUserDiff [29.833334 29.777778]
AFreezeDiff [2. 1.944445]
W_OutTempUserPrb.Val [-24.488002 -24.478 ]
DscgP_Prb_Circ1.Val [-4.287679 -4.291988]
W_OutTempUser [-24.488002 -24.478 ]
DscgP_Circ1 [-4.287679 -4.291988]
SuctTempCirc1 [-174.8 -174.7]
RegTypStartup [1 2]
RegTypRun [1 0]
SuctSH_Circ1 [225.2 225.3]
UserPmp1_Dout.Val [0 1]
UserPmp1_Aout.Val [ 0. 100.]
UserPmp1HrsThrsh [4377 4378]
HiW_TempStartupDT [59 55]
HiW_TempRunDT [181 186]
HiW_TempOfs [11.166667 11.444445]
DscgP_Circ2 [-4.287679 -4.291988]
SuctTempCirc2 [-174.8 -174.7]
DscgP_Prb_Circ2.Val [-4.287679 -4.291988]
SuctSH_Circ2 [225.2 225.3]
WorkingHours.UserPmp1Starts [0 1]
W_UserTempReg [-24.488002 -24.478 -78.805 ]
At_SP_Dout.Val [0 1]
SonicDensitySensor.SonicDensity_1.EnSensor [0 1]
Option 2 result is all rows except row #1. That tells me there are at least one value different between each row (1 thru 15). You can tweak my code to find out specific rows and columns that differ for each column.
TIME EVENT ... Differs newcol1
1 2020-12-15T17:23:55+01:00 NaN ... True True
2 2020-12-15T17:24:13+01:00 NaN ... True False
3 2020-12-15T17:24:24+01:00 NaN ... True False
4 2020-12-15T17:24:26+01:00 NaN ... True False
5 2020-12-15T17:24:29+01:00 NaN ... True False
6 2020-12-15T17:24:32+01:00 NaN ... True False
7 2020-12-15T17:24:35+01:00 NaN ... True False
8 2020-12-15T17:24:40+01:00 NaN ... True False
9 2020-12-15T17:24:42+01:00 NaN ... True False
10 2020-12-15T17:24:43+01:00 NaN ... True False
11 2020-12-15T17:24:53+01:00 NaN ... True False
12 2020-12-15T17:24:55+01:00 NaN ... True False
13 2020-12-15T17:25:01+01:00 NaN ... True False
14 2020-12-15T17:25:02+01:00 NaN ... True False