Why do "subset" and "[" on a dataframe give slightly different results?

Question

Could someone explain me why I get different results in my last two lines of code (identical() calls) below? These two objects seem to be identical objects, but when I use them in an apply function, I get some trouble:

df <- data.frame(a = 1:5, b = 6:2, c = rep(7,5))
df_ab <- df[,c(1,2)]
df_AB <- subset(df, select = c(1,2))
identical(df_ab,df_AB)
[1] TRUE

apply(df_ab,2,function(x) identical(1:5,x))
    a     b 
TRUE FALSE

apply(df_AB,2,function(x) identical(1:5,x))
    a     b 
FALSE FALSE

Joshua Ulrich · Accepted Answer · 2018-05-31T15:08:19.167

The apply() function coerces its first argument to a matrix before calling the function on each column. So your data frames are coerced to matrix objects. A consequence of that conversion is that as.matrix(df_AB) has non-null rownames, while as.matrix(df_ab) does not:

> str(as.matrix(df_ab))
 int [1:5, 1:2] 1 2 3 4 5 6 5 4 3 2
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:2] "a" "b"
> str(as.matrix(df_AB))
 int [1:5, 1:2] 1 2 3 4 5 6 5 4 3 2
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:5] "1" "2" "3" "4" ...
  ..$ : chr [1:2] "a" "b"

So when you apply() subset a column of df_AB, you get a named vector, which is not identical to an unnamed vector.

apply(df_AB, 2, str)
 Named int [1:5] 1 2 3 4 5
 - attr(*, "names")= chr [1:5] "1" "2" "3" "4" ...
 Named int [1:5] 6 5 4 3 2
 - attr(*, "names")= chr [1:5] "1" "2" "3" "4" ...
NULL

Contrast that with the subset() function, which selects rows using a logical vector for the value of i. And it looks like subsetting a data.frame with a non-missing value for i causes this difference in the row.names attribute:

> str(as.matrix(df[1:5, 1:2]))
 int [1:5, 1:2] 1 2 3 4 5 6 5 4 3 2
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:5] "1" "2" "3" "4" ...
  ..$ : chr [1:2] "a" "b"
> str(as.matrix(df[, 1:2]))
 int [1:5, 1:2] 1 2 3 4 5 6 5 4 3 2
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:2] "a" "b"

You can see the all the gory details of the difference between the data.frames using the .Internal(inspect(x)) function. You can look at those yourself, if you're interested.

As Roland pointed out in his comments, you can use the .row_names_info() function to see the differences in only the row names.

Notice that when i is missing, the result of .row_names_info() is negative, but it is positive if you subset with a non-missing i.

> .row_names_info(df_ab, type=1)
[1] -5
> .row_names_info(df_AB, type=1)
[1] 5

What these values mean is explained in ?.row_names_info:

type: integer.  Currently ‘type = 0’ returns the internal
      ‘"row.names"’ attribute (possibly ‘NULL’), ‘type = 2’ the
      number of rows implied by the attribute, and ‘type = 1’ the
      latter with a negative sign for ‘automatic’ row names.

The reason is that `[` creates "automatic" rownames (see `.row_names_info(df_ab, type=1)` and `subset` creates explicit rownames (see `.row_names_info(df_AB, type=1)`). `as.matrix` just propagates this (a matrix doesn't have to have rownames). — Roland, Oct 20 '14 at 15:19
+1. It seems that `identical(df_ab, df_AB)` should return false? — Señor O, Oct 20 '14 at 15:27
@SeñorO -- `identical(df_AB, df_ab, attrib.as.set=FALSE)` does return FALSE. — Josh O'Brien, Oct 20 '14 at 15:30
The `row.names` values are different but in a manner that should not have produced automatic naming by `as.matrix`. It's not `subset` that does the naming. — IRTFM, Oct 20 '14 at 15:35
See my comment on BondedDust's answer for more precise info on where in `as.matrix.data.frame` the processing of the two objects diverges (i.e. where `df_ab` gets NULL for its row.names). — Josh O'Brien, Oct 20 '14 at 15:36
Can you write a much shorter clearer answer, as per @Roland's comment? There's no need for the .Internal(inspect(x)) dump and it obfuscates things. — smci, May 26 '18 at 03:01

Sven Hohenstein · Answer 2 · 2014-10-20T18:10:01.757

If you want to compare the values 1:5 with the values in the columns, you should not use apply since apply transforms the data frames to matrices before the functions are applied. Due to the row names in the subset created with [ (see @Joshua Ulrich's answer), the values 1:5 are not identical to a named vector including the same values.

You should instead use sapply to apply the identical function to the columns. This avoids transforming the data frames to matrices:

> sapply(df_ab, identical, 1:5)
    a     b 
 TRUE FALSE 
> sapply(df_AB, identical, 1:5)
    a     b 
 TRUE FALSE

As you can see, in both data frames the values in the first column are identical to 1:5.

score 5 · Answer 3 · answered Oct 20 '14 at 15:16

5

In one version (using [) your columns are integers, while in the other version (using subset) your columns are named integers.

apply(df_ab, 2, str)

 int [1:5] 1 2 3 4 5
 int [1:5] 6 5 4 3 2
NULL


apply(df_AB, 2, str)

 Named int [1:5] 1 2 3 4 5
 - attr(*, "names")= chr [1:5] "1" "2" "3" "4" ...
 Named int [1:5] 6 5 4 3 2
 - attr(*, "names")= chr [1:5] "1" "2" "3" "4" ...
NULL

answered Oct 20 '14 at 15:16

Andrie

176,377
47
447
496

That's not exactly correct. `as.matrix` creates this difference if you use `apply`. See `lapply(df_AB, str)`. – Roland Oct 20 '14 at 15:21

score 3 · Answer 4 · answered Oct 20 '14 at 15:29

3

Looking at the structure of those two object s before they get submitted to apply shows only one difference: in the rownames, but not a difference that I would have expected to produce the difference you are seeing. I do not see Joshua's current offer of 'subset' as logical indexing as explaining this. Why row.names = c(NA, -5L)) produces a named result when extracting with "[" is as yet unexplained.

> dput(df_AB)
structure(list(a = 1:5, b = c(6L, 5L, 4L, 3L, 2L)), .Names = c("a", 
"b"), row.names = c(NA, 5L), class = "data.frame")
> dput(df_ab)
structure(list(a = 1:5, b = c(6L, 5L, 4L, 3L, 2L)), .Names = c("a", 
"b"), class = "data.frame", row.names = c(NA, -5L))

I do agree that it is the as.matrix coercion which needs further investigation:

> attributes(df_AB[,1])
NULL
> attributes(df_ab[,1])
NULL
> attributes(as.matrix(df_AB)[,1])
$names
[1] "1" "2" "3" "4" "5"

answered Oct 20 '14 at 15:29

IRTFM

258,963
21
364
487

The problem arises in `as.matrix.data.frame()`, in the line that calls `.row_names_info` (from `namespace:base`). It in turn calls `.Internal(shortRowNames())`, which gives a different result for the OP's two data.frame objects. Try `.Internal(shortRowNames(df_ab, 1L))` and `.Internal(shortRowNames(df_AB, 1L))` to see where the conversion of the two data.frames diverge... – Josh O'Brien Oct 20 '14 at 15:35
I don't see this as a problem with `as.matrix.data.frame`. It shouldn't have to determine whether `[.data.frame` caused rownames to be explicit instead of implicit. – Joshua Ulrich Oct 20 '14 at 15:38
That delivers the 5 and the -5, but it doesn't explain why a data.frame without rownames gets named. Both of those objects had what I would have expected to be considered "automatic" rownames. – IRTFM Oct 20 '14 at 15:39
1

@BondedDust: I interpret `c(NA, -5L)` as "completely implicit" and `c(NA, 5L)` as "explicit, standard `1:nrow(x)`". – Joshua Ulrich Oct 20 '14 at 15:41
Is that a distinction that has support in the documentation? It's the first time I have seen such a distinction. – IRTFM Oct 20 '14 at 15:49
@BondedDust See `help(".row_names_info")`. "type: integer. Currently type = 0 returns the internal "row.names" attribute (possibly NULL), type = 2 the number of rows implied by the attribute, and type = 1 the latter with a negative sign for ‘automatic’ row names." – Roland Oct 20 '14 at 15:54
@BondedDust: Somewhat. See the `type` arg definition in `?.row_names_info`. It says that the sign will be negative for "automatic" rownames... and I believe the `NA` is there for compactness when the rownames are `1:nrow(x)` (this is vaguely mentioned in the Note section of `?rownames`). – Joshua Ulrich Oct 20 '14 at 15:55
The `type` argument is being given to `.row_names_info`. It's not saying anything about the internal representation of rownames. The `.set_row_names` function has no `type` argument. – IRTFM Oct 20 '14 at 16:21
@JoshuaUlrich -- Nevertheless (and whether or not it should be doing it) `as.matrix.data.frame` *is* what's dropping the row names. To see exactly where it happens, just do `debug(as.matrix.data.frame)` then run `as.matrix(df_ab)` and step through (about 5 steps) to the line reading `if (.row_names_info(x) <= 0L) NULL else row.names(x)`, which assigns its value to `rn`, which eventually provides the rownames for the returned matrix. – Josh O'Brien Oct 20 '14 at 17:59
@JoshO'Brien: it's only dropping "automatic" rownames (which are required because data.frames must have rownames). So you could argue that they're not there in the first place, since the user didn't put them there, and they didn't do anything that might rely on them being there. – Joshua Ulrich Oct 20 '14 at 18:15
@JoshuaUlrich -- Aha, I see now. Thanks for explaining that (again). – Josh O'Brien Oct 20 '14 at 18:18

Why do "subset" and "[" on a dataframe give slightly different results?

4 Answers4