6

Recently I learned that I can use identical or all.equal to check whether 2 data sets are identical.

Can I also use them to check whether 2 R programs are identical? Is there a better or more appropriate way than below?

program.1 <- readLines("c:/r stuff/test program 1.r")
program.2 <- readLines("c:/r stuff/test program 2.r")

identical(program.1, program.2)
all.equal(program.1, program.2)
isTRUE(all.equal(program.1, program.2))

Thank you for any thoughts or advice.

Here are the contents of the 2 test programs being compared:

a <- matrix(2, nrow=3, ncol=4)

b <- c(1,2,3,4,5,6,7,8,6,5,4,3,2)

table(b)

c <- runif(2,0,1)

a * b

# March 2012 Edit begins here #

Here is a small example program for which Josh's function below returns FALSE while identical and all.equal return TRUE. I name the two program files 'testa.r' and 'testb.r'.

set.seed(123)

y <- rep(NA, 10)

s <- matrix(ceiling(runif(10,0,100)), nrow=10, byrow=T)

a   <- 25
ab  <- 50
abc <- 75

for(i in 1:10) {
     if(s[i] >  a  & s[i] <= ab ) y[i] = 1
     if(s[i] >  ab & s[i] <= abc) y[i] = 2
}

s
y

Here is the R program I use to read the two files containing the above code.

program.1 <- readLines("c:/users/Mark W Miller/simple R programs/testa.r")

program.2 <- readLines("c:/users/Mark W Miller/simple R programs/testb.r")


identical(program.1, program.2)
all.equal(program.1, program.2)
isTRUE(all.equal(program.1, program.2))


parseToSame <- function(file1, file2) {
    a <- parse(file = file1)
    b <- parse(file = file2)
    attributes(a) <- NULL
    attributes(b) <- NULL
    identical(a,b)
}

parseToSame(

     "c:/users/Mark W Miller/simple R programs/testa.r",
     "c:/users/Mark W Miller/simple R programs/testb.r"

)
Mark Miller
  • 12,483
  • 23
  • 78
  • 132
  • 3
    What do you mean by "identical"? If you mean that the source code is literally the same, then you can just use `diff`. –  Feb 27 '12 at 23:09
  • I guess I do not know how to use 'diff'. When I try it with the above example I get an error message. By 'identical' I guess I mean that the two programs are exactly the same with different names. – Mark Miller Feb 27 '12 at 23:18
  • What error message do you get? –  Feb 27 '12 at 23:19
  • Error in diff.default(program.1, program.3, 0, 0) : 'lag' and 'differences' must be integers >= 1 – Mark Miller Feb 27 '12 at 23:21
  • 2
    0_o Er...no, not the R function `diff`, the command line utility `diff`. If you're using Linux/Unix, type `man diff` on the command line (*not* in R). If you're using Windows, you can find `diff` as part of [GnuWin32](http://gnuwin32.sourceforge.net/) –  Feb 27 '12 at 23:23
  • I am using Windows 7. I see. I can try the cmd line outside R. – Mark Miller Feb 27 '12 at 23:25
  • I have to admit I'm befuddled that this got so many upvotes. Practically any decent text editor will let you run a "diff" on two files to show you _all_ the differences between the two. This really has nothing to do with the R language. Which differences matter is a much deeper question -- e.g. the "4" vs "4.00" noted below. – Carl Witthoft Feb 28 '12 at 01:21
  • you could `source()` each file, save all variables in two distinct environments, and compare each object in the two environments. Of course that won't work for the objects you simply print in the program; it's complementary to the answers you got so far. – baptiste Feb 28 '12 at 06:32
  • 1
    @CarlWitthoft The Question is whether, and if so how, one can do this in R. Which in and of itself is a reasonable Question; why move to a different tool if R can do a good job on the problem? Calling `diff` from within R would be the simplest approach but Josh's Answer is an interesting solution. – Gavin Simpson Feb 28 '12 at 09:34

2 Answers2

8

Here is a function that might be slightly more useful, in that it tests whether the two files parse to the same expression tree. (It will thus find the code in two files to be equivalent even if they have different formatting, additional blank lines and spaces, etc., as long as they parse to the same object.)

parseToSame <- function(file1, file2) {
    a <- parse(file = file1)
    b <- parse(file = file2)
    attributes(a) <- NULL
    attributes(b) <- NULL
    identical(a,b)
}

Here's a demo of the function in action:

# Create two files with same code but different formatting
tmp1 <- tempfile()
tmp2 <- tempfile()
cat("a <- 4; b <- 11; a*b \n", file = tmp1)
cat("a<-4

     b    <-    11 
     a*b \n", file = tmp2)

# Test out the two approaches
identical(readLines(tmp1), readLines(tmp2))
# [1] FALSE
parseToSame(tmp1, tmp2)
# [1] TRUE
Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
  • Thank you. I got your function to run with parseToSame( 'c:/r stuff/test program 1.r', 'c:/r stuff/test program 2.r' ) I am also attempting to learn how to use the command line function diff... ...or 'fc.exe'. – Mark Miller Feb 27 '12 at 23:52
  • Cool beans. It's worth noting that the function will still be thrown off by inconsequential (to us) differences like `1:3` vs. `c(1,2,3)`, or `x<-4` vs. `x<-4L` (but not `x<-4` vs `x<-4.000`), so use it with care! – Josh O'Brien Feb 28 '12 at 00:01
  • I do not know what I am doing wrong, but when I try your function on an actual R program I am using in my research the three approaches in my original question all say TRUE and your function says FALSE. I simply opened a file and saved it with a new (albeit longer) name but made no other keystrokes and repeated the process three times. Sorry for my confusion. Thanks for the suggestion. – Mark Miller Feb 28 '12 at 00:42
  • Interesting. I'd suggest trying to: (1) read in my function; (2) do `debug(parseToSame)`; (3) call `parseToSame(f1, f2)` with your two files; (4) Then step through the evaluation in my function to the last line, at which point you can type `a` and `b` (and/or `str(a)` and `str(b)`) to examine the parse trees of the two files, to see where they might differ. Hard for me help much more than that without seeing the files themselves. Please let me know what you find. – Josh O'Brien Feb 28 '12 at 05:05
  • @JoshO'Brien I have finally taken a close look at my code that was returning FALSE with your function and TRUE with identical or all.equal. Then I modified and distilled my program down to a small, functional, reproducible example and have added that example to my post. I do not know why your function returns FALSE with this example, while identical and all.equal return TRUE. – Mark Miller Mar 31 '12 at 19:41
  • @MarkMiller -- Hi Mark. I did `debug(parseToSame)`, and then stepped through the computations, and have isolated the problem. It's the `for()` loop that's for some reason causing the problem. To see how, try this: `identical(expression(for(i in 1) {i}), expression(for(i in 1) {i}))`. It returns `FALSE`. Interestingly, the following, very similar call returns `TRUE`: `identical(expression(for(i in 1) i), expression(for(i in 1) i))`. If you learn anything about why those two give different results, do let me know! Cheers. – Josh O'Brien Mar 31 '12 at 22:14
  • +1. Just ran across this, and wanted to add that it is a brilliant example of a great answer. It address what it means for what it means to be "the same" relative to `R` as opposed to a text editor. – Ricardo Saporta Mar 16 '13 at 23:34
  • @RicardoSaporta -- Thanks! It was nice to be brought back here after my recent exchange in comments on [this question](http://stackoverflow.com/questions/15368168/does-white-space-slow-down-processing/15368377#comment21717578_15368377). There I hit on a different approach to testing sameness, but the take-home message (about how R stores already-parsed function objects is) is the same. – Josh O'Brien Mar 20 '13 at 02:17
  • Funny, I had that _same_ question in mind when I read this answer! – Ricardo Saporta Mar 20 '13 at 22:05
3

Yes, you can. But they might not be flexible enough for your needs. program.1 and program.2 would have to be exactly equal, with same code on same lines etc. No offsets would be allowed. @Jack Maney mentioned diff in the comments above. That allows for more flexibility in identical lines perhaps being offset by 1 or more lines. Note he means the standard diff utility not the R function diff().

The reason the two would need to be exactly equal is that readLines() reads the lines of the files in as a vector of characters (strings):

> con <- textConnection("foo bar foo\nbar foo bar")
> foo <- readLines(con)
> close(con)
> str(foo)
 chr [1:2] "foo bar foo" "bar foo bar"

When using identical() and all.equal(), they will compare element 1 of program.1 with element 1 of program.2, and so on for all elements (lines). Even if the code was identical but contained an extra carriage return say, both identical() and all.equal() will return FALSE because the elements of the two character vectors will not be equal in any sense.

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453