2

I have a large csv(15 rows and 2500 columns) and am trying to compare values in each row with the row above. To do this I split up each row into it's own dataframe with iloc from pandas. So now I have 15 dataframes that I can trying to compare to one another using compare(). The issue is I keep getting the error Can only compare identically-labeled DataFrame objects but when putting each dataframe into a csv they are all labeled the same thing so I figured maybe if I got rid of the labels that would work and tried this and that didn't work so I tried sorting the indexes as show here and I still get the same error. The CSV is filled mostly with floats and occasional NaN values for what that's worth.

I split it up with iloc using df_i = df.iloc[[i]] where i is replaced with 1-14 to get each of the rows as a dataframe.

Printing the dataframes gives me this output:

                        TIME  EVENT  Unld1Comp1Circ2_Dout.Val  ...  WorkingHours.Start_Count_14.Cnt  WorkingHours.Start_Count_15.Cnt  WorkingHours.Start_Count_16.Cnt
1  2020-12-15T17:23:55+01:00    NaN                         1  ...                                0                                0                                0

[1 rows x 2463 columns] 
                         TIME  EVENT  Unld1Comp1Circ2_Dout.Val  ...  WorkingHours.Start_Count_14.Cnt  WorkingHours.Start_Count_15.Cnt  WorkingHours.Start_Count_16.Cnt
2  2020-12-15T17:24:13+01:00    NaN                         1  ...                                0                                0                                0

[1 rows x 2463 columns]

Hercislife
  • 159
  • 2
  • 10
  • Can you show how did you split up each row with iloc? Maybe the problem is there – The Niv Dec 16 '20 at 21:52
  • @TheNiv I added that to the post. – Hercislife Dec 16 '20 at 22:06
  • Have you considered this approach? https://stackoverflow.com/questions/41399538/comparing-previous-row-values-in-pandas-dataframe It's likely easier. – Nick ODell Dec 16 '20 at 22:07
  • @NickODell That is definitely a good idea but unfortunately with there being 2500 columns I can't think of an easy way to do it for all of them. – Hercislife Dec 16 '20 at 22:11
  • Is it possible for you to share the raw data before you split the columns using iloc? There has to be a simpler way. Are you trying to compare rows for columns 3 thru 2463 ? – Joe Ferndz Dec 16 '20 at 22:23
  • maybe instead of asking for labels you should rather ask how to efficiently compare rows. Maybe there is better method then splitting to dataframes. and then your question is only [XY Problem](https://en.wikipedia.org/wiki/XY_problem) – furas Dec 16 '20 at 22:33
  • @JoeFerndz By share the raw data you mean upload a csv? If so sure, here is a link to an empty gitlab page that just has the [CSV](https://gitlab.com/greenbeast/carel_pandas/-/blob/master/RetainLog.csv). I am trying to going down each row and comparing every column to the row before it. So basically if column 1200 in row 3 has a different value than that of column 1200 in row 2 or row 4 it'll tell me. I am entirely open to new ideas but most of what I saw online had hardcoding the columns which isn't feasible in this case. – Hercislife Dec 16 '20 at 22:36
  • Thanks for sharing the link. I will download and work on this. – Joe Ferndz Dec 16 '20 at 23:29
  • should the comparison include column 2 (EVENT) ? – Joe Ferndz Dec 17 '20 at 00:36
  • I ran the script and it looks like there is a difference in many columns. So you will get all rows as output. For ex: df['AFreezeSetP'] has two values [-39.666669, -39.333334] – Joe Ferndz Dec 17 '20 at 01:48

1 Answers1

1

There are a couple of ways to find out the difference between two rows in a dataframe.

Option 1:

for col in df.columns[2:]: #check column by column from 3rd column thru the end

    if df[col].nunique() > 1:
        print (col, df[col].unique())

This will print all the columns that have different values. If nunique() is more than 1, then there is more than one value in that column. However, this will NOT tell you which row has the difference.

Option 2:

An alternate way is to do a df[col].shift() and compare against the previous row. If there is a difference, then keep tab of that. Do the same comparison for each row. Consolidate all the differences and you will get a list of all rows that have at least one value different between the next row.

To do this, you can do as follows:

import pandas as pd
df = pd.DataFrame({'col1':[1,1,1,1,1,1,1,1],
                   'col2':[2,2,2,2,2,2,2,2],
                   'col3':[3,3,3,3,3,3,3,4],
                   'col4':[4,4,4,4,4,4,5,4],
                   'col5':[5,5,5,5,5,1,5,5]})
print (df)

df['Differs'] = False #set all rows to matched 

for col in df.columns[2:]: #check column by column from 3rd column thru the end

    #if df[col].nunique() > 1:
        #print (col, df[col].unique())

    df['newcol1'] = df[col].shift() != df[col] #check against next row. True if differs
    df.loc[:0,'newcol1'] = False # Tweak the first row as it should be ignored
    
    df.loc[df['newcol1'] == True,'Differs'] = True #if any row matched, set Differs to True

print (df[df['Differs']]) #print all rows that has a different value in at least one column 

In the above example, rows 3, 4, 5 have at least one value different from the previous row.

For the below given dataframe:

   col1  col2  col3  col4  col5
0     1     2     3     4     5
1     1     2     3     4     5
2     1     2     3     4     5
3     1     2     3     4     5
4     1     2     3     4     5
5     1     2     3     4     1
6     1     2     3     5     5
7     1     2     4     4     5

The output will be:

   col1  col2  col3  col4  col5  Differs  newcol1
5     1     2     3     4     1     True     True
6     1     2     3     5     5     True    False
7     1     2     4     4     5     True    False

Using the two techniques, I ran the comparison.

Option 1 result:

All these columns have more than one value. The values are in the list next to column name.

AFreezeSetP [-39.666669 -39.333334]
AFreezeUserT_1K [23 19]
UserPmp1_On [0 1]
RunTempRegKp [3.111111 3.444445]
RunTempRegTi [399 398]
RunTempRegTd [99 96]
RegSetP [-27.333334 -26.888891]
CoolSetP [-27.333334 -26.888891]
AFreezeUserDiff [29.833334 29.777778]
AFreezeDiff [2.       1.944445]
W_OutTempUserPrb.Val [-24.488002 -24.478   ]
DscgP_Prb_Circ1.Val [-4.287679 -4.291988]
W_OutTempUser [-24.488002 -24.478   ]
DscgP_Circ1 [-4.287679 -4.291988]
SuctTempCirc1 [-174.8 -174.7]
RegTypStartup [1 2]
RegTypRun [1 0]
SuctSH_Circ1 [225.2 225.3]
UserPmp1_Dout.Val [0 1]
UserPmp1_Aout.Val [  0. 100.]
UserPmp1HrsThrsh [4377 4378]
HiW_TempStartupDT [59 55]
HiW_TempRunDT [181 186]
HiW_TempOfs [11.166667 11.444445]
DscgP_Circ2 [-4.287679 -4.291988]
SuctTempCirc2 [-174.8 -174.7]
DscgP_Prb_Circ2.Val [-4.287679 -4.291988]
SuctSH_Circ2 [225.2 225.3]
WorkingHours.UserPmp1Starts [0 1]
W_UserTempReg [-24.488002 -24.478    -78.805   ]
At_SP_Dout.Val [0 1]
SonicDensitySensor.SonicDensity_1.EnSensor [0 1]

Option 2 result is all rows except row #1. That tells me there are at least one value different between each row (1 thru 15). You can tweak my code to find out specific rows and columns that differ for each column.

                         TIME  EVENT  ...  Differs  newcol1
1   2020-12-15T17:23:55+01:00    NaN  ...     True     True
2   2020-12-15T17:24:13+01:00    NaN  ...     True    False
3   2020-12-15T17:24:24+01:00    NaN  ...     True    False
4   2020-12-15T17:24:26+01:00    NaN  ...     True    False
5   2020-12-15T17:24:29+01:00    NaN  ...     True    False
6   2020-12-15T17:24:32+01:00    NaN  ...     True    False
7   2020-12-15T17:24:35+01:00    NaN  ...     True    False
8   2020-12-15T17:24:40+01:00    NaN  ...     True    False
9   2020-12-15T17:24:42+01:00    NaN  ...     True    False
10  2020-12-15T17:24:43+01:00    NaN  ...     True    False
11  2020-12-15T17:24:53+01:00    NaN  ...     True    False
12  2020-12-15T17:24:55+01:00    NaN  ...     True    False
13  2020-12-15T17:25:01+01:00    NaN  ...     True    False
14  2020-12-15T17:25:02+01:00    NaN  ...     True    False
Dharman
  • 30,962
  • 25
  • 85
  • 135
Joe Ferndz
  • 8,417
  • 2
  • 13
  • 33
  • Thanks for this. I think I will have to modify the first option to see if I can find a way to have it give me the row and column that is changed because my boss doesn't want there to just be a true false because that means that for each instance you would have to go through the whole process of looking through the 2500 columns to find the difference. – Hercislife Dec 17 '20 at 20:54