4

I want to compare two data frames and check if both have identical set of columns, is there any built-in function or any library in R? Values of these data frames might be different but both the data frames will have same type and named columns.

I tried running identical and all_equal on mtcars and a replica dataframe:

duplicate <- mtcars

identical(mtcars, duplicate)    
[1] TRUE

all_equal(mtcars, duplicate)
[1] TRUE

Then I updated the mpg column of data.frame duplicate to have different values than mtcars:

duplicate$mpg <- as.numeric(scale(duplicate$mpg))

Again run the same commands:

identical(mtcars, duplicate)

[1] FALSE

all_equal(mtcars, duplicate)
[1] "Rows in x but not y: 23, 1, 6, 14, 10, 12, 13, 17, 28, 32, 7[...]. Rows in y but not x: 12, 25, 1, 20, 30, 5, 14, 7, 11, 29, 21[...]. "

Now they appear as not identical dataframes.

I want to compare and check in this second case where values are different but column names and their types are same. Basically if both have same schema.

Sam Firke
  • 21,571
  • 9
  • 87
  • 105
  • 3
    You can check with `identical(names(dat1), names(dat2))` and `identical(sapply(dat1, class), sapply(dat2, class))` – akrun Mar 24 '18 at 17:17
  • Based on your data, combining both the logical conditions should work `identical(names(mtcars), names(duplicate)) & identical(sapply(mtcars, class), sapply(duplicate, class))# [1] TRUE`. YOu may also check `library(diffObj);diffObj(mtcars, duplicate)` – akrun Mar 24 '18 at 17:30
  • 4
    `all.equal(mtcars, duplicate, tolerance = Inf)` – rawr Mar 24 '18 at 17:30
  • @rawr: My understanding of the question (quite possibly incorrect) is that neither the interior values not the number of rows were at issue. In any case, the questioner now has several choices for his purposes. – IRTFM Mar 24 '18 at 17:55

2 Answers2

4

I think the answer to the question: is there an R "same-schema" function for dataframes is "probably not". R dataframes don't really have a database structure. @akrun gave you a two-part solution if you wanted to test the equality of names and class. This would be another approach that basically empties out the dataframes, but preserves their column names and class:

identical(duplicate[NA,][1,], mtcars[NA,][1,])
[1] TRUE

This checks not only names but also classes of the overall object and the classes of the underlying columns, as can be tested with:

 my.schema <-  mtcars[NA,][1,] 
 my_schema[['mpg']] <- NA_integer_

identical(duplicate[NA,][1,], my.schema)
[1] FALSE

Merely changing the class from double to integer caused identical to report non-identity. The identity function can be rather picky and people have asked a fair number of SO questions about why FALSE is reported. Even the presence of attribute differences (which are often not "visible" in print output of objects) will be reported as "different".

Another way (probably more elegant and intuitive) to create a "schema" for a dataframe would be to index the rows with 0:

mtcars[0,]

sapply( mtcars[0,] , class)
      mpg       cyl      disp        hp      drat        wt      qsec        vs 
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
       am      gear      carb 
"numeric" "numeric" "numeric" 
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Yes. I think this is something often misunderstood for new R users: "R dataframes don't really have a database structure" – De Novo Mar 24 '18 at 19:02
4

compare_df_cols_same() from the janitor package checks whether data.frames have the same column names and that the classes of those columns match:

library(janitor)
duplicate <- mtcars
duplicate$mpg <- as.numeric(scale(duplicate$mpg))
compare_df_cols_same(mtcars, duplicate)
#> [1] TRUE

The related compare_df_cols(mtcars, duplicate) allows for more detailed comparison to see which columns do or don't match.

Full disclosure: I maintain this package and am providing this answer since you asked if there's a library that contains exactly this function - and now there is.

Sam Firke
  • 21,571
  • 9
  • 87
  • 105
  • 1
    great function, worth adding that people coming here are probably really looking for this option (as I was) `compare_df_cols_same(mtcars, duplicate, bind_method = "rbind")` – user63230 Mar 01 '22 at 11:57