R: Determining whether data frames in a list are identical

Question

With a function f

f <- function(x) { data.frame(a=c(x, 2*x), b=c(2*x, 4*x)) }

we can construct two data frames

df1 <- f(5)
df2 <- f(5)

and want to confirm that they they are equal. Because we ultimately want to obtain a Boolean, we use identical, and indeed

identical(df1, df2)

evaluates to TRUE.

Now we compute three terms

terms <- lapply(rep(5, 3), f)

and want to determine whether the three data frames are equal. We choose to compare with the first term

first.term <- terms[1]

and evaluate

lapply(terms,
       function(x) identical(x, first.term))

but we get three FALSEs, not three TRUEs. What am I missing?

@markus Please add your comment as an answer, then I'll roll back the update and accept that answer. As it is there are two questions, and the present answer satisfies neither. — Vrokipal, Jul 24 '18 at 20:10

score 1 · Answer 1 · answered Jul 17 '18 at 13:45

1

To check for every combination of data.frames in terms

apply(combn(length(terms), 2), 2, function(x)
    all.equal(terms[[x[1]]], terms[[x[2]]]))
#[1] TRUE TRUE TRUE

To return whether all data.frames in terms are identical

all(apply(combn(length(terms), 2), 2, function(x)
    all.equal(terms[[x[1]]], terms[[x[2]]])))
#[1] TRUE

answered Jul 17 '18 at 13:45

Maurits Evers

49,617
4
47
68

1

Cool, but doesn't the use of `combn` provide a solution that takes O(n^2) for a problem that is O(n)? – Vrokipal Jul 17 '18 at 13:57
1

@Vrokipal, the problem is not O(n).... You need to compare the first data.frame with `n - 1` data.frames, the second with `n - 2`, etc. giving `n * (n - 1) / 2` comparisons – Emil Jul 17 '18 at 13:59
1

@Emil Can you provide any input in R for which the result of the expression ((a == b) & (a == c)) is not the same as ((a == b) & (a == c) & (b == c))? Here a, b, and c can be anything you want, not just data frames. For each type we'd use the appropriate comparison operator; hence for data frames we'd use `identical`, not `==`. – Vrokipal Jul 17 '18 at 14:02
@Vrokipal Yes the problem is `O(n^2)` if you check for all combinations. If you only compare to the first (like you mention in your post) it's `O(n)`. You already got the solution for the latter approach: `all(sapply(terms, function(x) all.equal(x, terms[[1]])))` – Maurits Evers Jul 17 '18 at 14:11
@Emil I'll take a stab answering my own question (the one in the comment). If we do a fuzzy comparison of data frames using `all.equal`, then the two terms do not yield the same result. It may well be that a is close to both b and c, but b and c are not within an acceptable threshold of each other. – Vrokipal Jul 17 '18 at 14:13
1

@MauritsEvers, I think you misunderstood why the OP says the problem is O(n). Using the associative property, it is only necessary to compare one element to the rest of the list. Comparing every possible combination creates many unnecessary checks. For example, with a collection of five, say `{a, b, c, d, e}`, one only needs 4 comparisons to determine if all elements are the same, not `choose(5, 2) = 10`. – Emil Jul 17 '18 at 14:17
@Emil Yes thanks for the clarification; I'm clear on the `O(n)` complexity, and that comparing to the first element suffices;-) OP already has an answer for that which I reiterated in my previous comment. In a wider context, performing all pairwise comparisons (which as you showed will be `O(n^2)`) might be useful to explore similarities between subsets of elements. I agree that this might go beyond what OP is after. That's why my first sentence of my answers starts with "To check for every combination ...". – Maurits Evers Jul 17 '18 at 14:31
@MauritsEvers, my mistake... I see where you are getting at now. – Emil Jul 17 '18 at 14:33
@Emil absolutely no need to apologise. I appreciate your clarification! – Maurits Evers Jul 17 '18 at 14:36

score 0 · Accepted Answer · answered Jul 24 '18 at 21:18

The problem in OP's code was the use of `[` instead of `[[`. The former returns a list containing a data.frame while the latter returns that data.frame.

first.term <- terms[[1]]
lapply(terms, function(x) identical(x, first.term))
#[[1]]
#[1] TRUE
#
#[[2]]
#[1] TRUE
#
#[[3]]
#[1] TRUE

R: Determining whether data frames in a list are identical

2 Answers2