R: replace NA with item from vector

Question

I am trying to replace some missing values in my data with the average values from a similar group.

My data looks like this:

   X   Y
1  x   y
2  x   y
3  NA  y
4  x   y

And I want it to look like this:

  X   Y
1  x   y
2  x   y
3  y   y
4  x   y

I wrote this, and it worked

for(i in 1:nrow(data.frame){
   if( is.na(data.frame$X[i]) == TRUE){
       data.frame$X[i] <- data.frame$Y[i]
   }
  }

But my data.frame is almost half a million lines long, and the for/if statements are pretty slow. What I want is something like

is.na(data.frame$X) <- data.frame$Y

But this gets a mismatched size error. It seems like there should be a command that does this, but I cannot find it here on SO or on the R help list. Any ideas?

As an aside - it's probably not good to use `data.frame` as your variable name, since in some contexts that masks the `data.frame()` function. — Ken Williams, Jul 14 '11 at 18:02
As @hadley said, this isn't really a problem. I assume the Y column does not contain all of the same value... Like he said, we need context. — OTStats, Dec 07 '18 at 16:48

score 12 · Accepted Answer · answered Jul 13 '11 at 21:26

12

ifelse is your friend.

Using Dirk's dataset

df <- within(df, X <- ifelse(is.na(X), Y, X))

answered Jul 13 '11 at 21:26

Richie Cotton

118,240
47
247
360

Care to compare speeds of your and Dirk's answer? – Roman Luštrik Jul 14 '11 at 10:16
I didn't clock either method, but they both execute immediately (unlike the several minutes it took with my original code). I think I prefer this method simply because it uses one line of code instead of two. – gregmacfarlane Jul 14 '11 at 14:33

score 9 · Answer 2 · answered Jul 13 '11 at 19:49

9

Just vectorise it -- the boolean index test is one expression, and you can use that in the assignment too.

Setting up the data:

R> df <- data.frame(X=c("x", "x", NA, "x"), Y=rep("y",4), stringsAsFactors=FALSE)
R> df
     X Y
1    x y
2    x y
3 <NA> y
4    x y

And then proceed by computing an index of where to replace, and replace:

R> ind <- which( is.na( df$X ) )
R> df[ind, "X"] <- df[ind, "Y"]

which yields the desired outcome:

R> df
  X Y
1 x y
2 x y
3 y y
4 x y
R>

answered Jul 13 '11 at 19:49

Dirk Eddelbuettel

360,940
56
644
725

What's the purpose of `which`? Is numerical indexing faster / less error prone than logical? – Joshua Ulrich Jul 13 '11 at 19:55
1

I prefer numeric indices (here a single '3') rather than a boolean of length N. – Dirk Eddelbuettel Jul 13 '11 at 19:56
2

@Joshua: I've found that numerical indexing can indeed be much faster than logical, if the number of TRUE cases is small relative to the total number of elements. – Hong Ooi Jul 14 '11 at 01:02
Which is redundant here. I guess it all depends on whether you prefer Boolean algebra or set theory. – hadley Jul 15 '11 at 01:10
Shorter is better, and makes it easier to check interim results. Awaiting `whch2` with my breath held ;-) – Dirk Eddelbuettel Jul 15 '11 at 01:19

score 1 · Answer 3 · answered Aug 25 '22 at 07:14

1

If you are already using dplyr or tidyverse, you can use the coalesce function to do exactly this.

> df <- data.frame(X=c("x", "x", NA, "x"), Y=rep("y",4), stringsAsFactors=FALSE)
> df %>% mutate(X = coalesce(X, Y))
  X Y
1 x y
2 x y
3 y y
4 x y```

answered Aug 25 '22 at 07:14

Olsgaard

1,006
9
19

score 0 · Answer 4 · edited May 23 '17 at 12:32

Unfortunately I cannot comment, yet, but while vectorizing some code where strings aka characters were involved the above seemd to not work. The reason being explained in this answer. If characters are involved stringsAsFactors=FALSE is not enough because R might already have created factors out of characters. One needs to ensure that the data also becomes a character vector again, e.g., data.frame(X=as.character(c("x", "x", NA, "x")), Y=as.character(rep("y",4)), stringsAsFactors=FALSE)

R: replace NA with item from vector

4 Answers4

Linked