1

Using Julia, is there a way to compare 2 DataFrames cell by cell and output difference

E.g.: enter image description here

expected result would produce dataframe with True/False enter image description here

Thanks in advance for help

NorthPole
  • 27
  • 7

2 Answers2

4

AbstractDataFrame objects support broadcasting so you can just write:

julia> df1 .== df2
3×2 DataFrame
│ Row │ Col1 │ Col2 │
│     │ Bool │ Bool │
├─────┼──────┼──────┤
│ 1   │ 1    │ 1    │
│ 2   │ 1    │ 1    │
│ 3   │ 0    │ 1    │

or

julia> isequal.(df1, df2)
3×2 DataFrame
│ Row │ Col1 │ Col2 │
│     │ Bool │ Bool │
├─────┼──────┼──────┤
│ 1   │ 1    │ 1    │
│ 2   │ 1    │ 1    │
│ 3   │ 0    │ 1    │

The difference between == and isequal is how they handle the case if you have missing value in a cell (== will produce missing in such a case and isequal produces true/false).

Using the Matrix approach that Przemyslaw proposes will ignore column names (and in general will be expensive as it performs copying of data). The second option proposed by Przemyslaw ignores column order in the data frames (in some cases you actually might want it) and does not check if the sets of column names in both data frames are the same.

Bogumił Kamiński
  • 66,844
  • 3
  • 80
  • 107
  • Thank you Bogumil for the explanation, I did not know its so easy in Julia, I was struggling with Python specially when columns names are different in both DataFrames, although I was able implement with python but its very slow – NorthPole Jun 27 '20 at 07:14
  • I see option 2 is working only when column names are same in both dataframes. But Matrix approach is working even column names are different – NorthPole Jun 27 '20 at 07:26
  • Yes - however, in your specification of desired output you provided a plot with column names, so I assumed you want to retain them. – Bogumił Kamiński Jun 27 '20 at 07:34
3

Basically you need to use .== in one of many ways.

using DataFrames
df1 = DataFrame(Col1=["A","B","C"],Col2=["X","Y","Z"])
df2 = DataFrame(Col1=["A","B","D"],Col2=["X","Y","Z"])

This is the shortest version:

julia> Matrix(df1) .== Matrix(df2)
3×2 BitArray{2}:
 1  1
 1  1
 0  1

In this approach you can use dimension dropping [:] to get the list of unmatched values:

julia> Matrix(df2)[:][(.!(Matrix(df1) .== Matrix(df2))[:])]
1-element Array{String,1}:
 "D"

If you want a DataFrame:

julia> DataFrame((n => df1[!,n] .== df2[!,n] for n in names(df2))...)
3×2 DataFrame
│ Row │ Col1 │ Col2 │
│     │ Bool │ Bool │
├─────┼──────┼──────┤
│ 1   │ 1    │ 1    │
│ 2   │ 1    │ 1    │
│ 3   │ 0    │ 1    │
Przemyslaw Szufel
  • 40,002
  • 3
  • 32
  • 62
  • Thank you Przemyslaw for the solution, I am using option 2. Is there a way I can out put at least one miss matched value, i.e. something like DataFram1 has Col1 = C but DataFrame2 has Col1 = D at Row number 3 – NorthPole Jun 27 '20 at 07:22