7

I wonder why two data frames a and b have different outcomes when a non-existent rowname is retrieved. For example,

a <- as.data.frame(matrix(1:3, ncol = 1, nrow = 3, dimnames = list(c("A1", "A10", "B"), "V1")))
a
    V1
A1   1
A10  2
B    3

b <- as.data.frame(matrix(4:5, ncol = 1, nrow = 2, dimnames = list(c("A10", "B"), "V1")))
b
    V1
A10  4
B    5

Let's try to get "A10", "A1", "A" from data frame a:

> a["A10", 1]
[1] 2
> a["A1", 1]
[1] 1                    # expected
> a["A", 1]
[1] NA                   # expected
> a["B", 1]
[1] 3                    # expected
> a["C", 1]
[1] NA                   # expected

Let's do the same for data frame b:

> b["A10", 1]
[1] 4
> b["A1", 1]
[1] 4                    # unexpected, should be NA
> b["A", 1]              
[1] 4                    # unexpected, should be NA
> b["B", 1]
[1] 5                    # expected
> b["C", 1]
[1] NA                   # expected

Now that a["A", 1] returns NA, why does b["A", 1] or b["A1", 1] not?

PS. R version 3.5.2

starball
  • 20,030
  • 7
  • 43
  • 238
foehn
  • 431
  • 4
  • 13
  • 3
    Probably due to partial matching of row names. – Ahmed Ali Jan 14 '22 at 21:54
  • Thanks @AhmedAli, I kind of heard about it, such as https://stackoverflow.com/questions/14153904/why-does-r-use-partial-matching, but shouldn't it be limited to lists/colnames only? – foehn Jan 14 '22 at 22:00
  • No, it seems to be present in data.frame as well. For example, see https://stackoverflow.com/questions/34233235/r-returning-partial-matching-of-row-names You can also check that data.frame subsetting uses pmatch `View(\`[.data.frame\`)` – Ahmed Ali Jan 14 '22 at 22:06
  • Hmmm. `?"["` says "Unlike S (Becker _et al_ p. 358), R **never uses partial matching when extracting by ‘[’**" - is this a documentation bug (or at least a doc/code mismatch), or have I misunderstood something?? – Ben Bolker Jan 15 '22 at 01:35
  • @Ben Bolker I read that the same way you do. It appears that there is an undocumented exception. This has to be partial matching as Ahmed Ali said. I tried this with various combinations of letters and numbers, letters only, and numbers only (I guess numbers vs letters is a moot point since they are all read as characters). No matter what, if an exact match is unavailable, R accepts the call based on the first characters in the row name matching the index you use. – Tanner33 Jan 15 '22 at 02:25

1 Answers1

3

Synthesizing some of the comments here...


?`[` says:

Unlike S (Becker et al p. 358), R never uses partial matching when extracting by [, and partial matching is not by default used by [[ (see argument exact).

But ?`[.data.frame` says:

Both [ and [[ extraction methods partially match row names. By default neither partially match column names, but [[ will if exact = FALSE (and with a warning if exact = NA). If you want to exact matching on row names use match, as in the examples.

The example given there is:

sw <- swiss[1:5, 1:4]
sw["C", ]
##            Fertility Agriculture Examination Education
## Courtelary      80.2          17          15        12

sw[match("C", row.names(sw)), ]
##    Fertility Agriculture Examination Education
## NA        NA          NA          NA        NA

Meanwhile:

as.matrix(sw)["C", ]
## Error in as.matrix(sw)["C", ] : subscript out of bounds

So row names of matrices are matched exactly while row names of data frames are matched partially, and both behaviours are documented.

[.data.frame is implemented in R, not C, so you can inspect the source code by printing the function. The partial matching happens here:

    if (is.character(i)) {
        rows <- attr(xx, "row.names")
        i <- pmatch(i, rows, duplicates.ok = TRUE)
    }

There happens to be a recent thread on Bugzilla about partial matching of row names of data frames. (No discussion yet...)

It is definitely surprising that [.data.frame doesn't match the behaviour of [ with respect to character indices.

Mikael Jagan
  • 9,012
  • 2
  • 17
  • 48