1

I am creating two data frames and then merging them into a third:

dat <- data.frame(code = c("A11", "B22", "C33"),
                  age = c(NA, NA, 12),
                  sex = c(NA, NA, 2),
                  more = c(7, 4, 9),
                  stringsAsFactors = FALSE)

age.and.sex <- read.table(textConnection("
code age sex
A11 15 2
B22 10 1
"), header = TRUE, stringsAsFactors = FALSE)

joined <- merge(dat, age.and.sex, by="code", all.x=TRUE)

joined
  code age.x sex.x more age.y sex.y
1  A11    NA    NA    7    15     2
2  B22    NA    NA    4    10     1
3  C33    12     2    9    NA    NA

Now, when I try to copy values from the two new columns ("age.y", "sex.y") into the two old ones ("age.x", "sex.y"), this works for one column, but for the other I get a curious warning:

joined[is.na(joined$age.x)]$age.x <- joined$age.y
joined[is.na(joined$sex.x)]$sex.x <- joined$sex.y
Warning message:
In `[<-.data.frame`(`*tmp*`, is.na(joined$sex.x), value = list(code = c("A11",  :
  provided 5 variables to replace 4 variables

What is going on here?

2 Answers2

4

Both of those column "replacements" are completely wrong, but let's focus on the second one, since the first failed in a silent way more or less by accident.

So let's walk through the line

joined[is.na(joined$sex.x)]$sex.x <- joined$sex.y

step by step.

So start with what the is.na() piece is going to return:

is.na(joined$sex.x)
[1]  TRUE  TRUE FALSE

A boolean vector of length three. Ok, I guess. Now what happens when we essentially do joined[c(T,T,F)]?

> joined[is.na(joined$sex.x)]
  code age.x more age.y
1  A11    15    7    15
2  B22    10    4    10
3  C33    NA    9    NA

I bet you weren't expecting that! R thinks you are selecting columns, recycles the boolean vector and selects columns 1, 2, 4 and 5. Notice that sex.x isn't present at all:

> joined[is.na(joined$sex.x)]$sex.x
NULL

At this point it should be totally clear why you got a warning. You were attempting to assign to something that did not exist.

The first attempt failed as well (it copied the NA from the .y column, which is probably not what you wanted). But you were just lucky that the column you wanted was present at all.

More generally, replacement and subsetting is probably a fragile way to do this, you are probably looking for a coalesce function.

Community
  • 1
  • 1
joran
  • 169,992
  • 32
  • 429
  • 468
1

You indexed the columns of your dataframe, and, by accident, your index was a multiple of the number of columns. look at what you indexed:

> joined[is.na(joined$age.x)]
  code age.x more age.y
1  A11    NA    7    15
2  B22    NA    4    10
3  C33    12    9    NA

I guess what you want to do is the following:

joined[is.na(joined$age.x),]$age.x <- joined$age.y[is.na(joined$age.x)]
joined[is.na(joined$sex.x),]$sex.x <- joined$sex.y[is.na(joined$sex.x)]
David Heckmann
  • 2,899
  • 2
  • 20
  • 29
  • 2
    I guess `joined$age.x[is.na(joined$age.x)]` might be more efficient. – nicola Mar 18 '16 at 15:39
  • 1
    I tried to stay close to the OPs solution to keep it more instructive. I agree that your way or the one joran suggested would be preferred. – David Heckmann Mar 18 '16 at 15:53
  • @nicola Why is that more efficient? –  Mar 18 '16 at 16:00
  • 1
    @what Extracting a column from a `data.frame` is a relatively cheap operation, since in most cases internally is enough to get the pointer of that column. Subsetting might be more expensive, and surely subsetting a `data.frame` is more expensive than just a vector.. By `joined[is.na(joined$age.x),]$age.x` R must subset, create a completely new `data.frame` and from it extract the given column. If you invert the order, you extract the column first (fast) and then subset just a column, eliminating the overhead due to the construction of a `data.frame`. – nicola Mar 18 '16 at 16:29
  • Thank you, @nicola. I understand. –  Mar 18 '16 at 16:45