Replace values in a data.frame column based on values in a different column

Question

I have this data.frame:

df <- data.frame(id = rep(c("one", "two", "three"), each = 10), week.born = NA)
df$week.born[c(5,15,28)] <- c(23,19,24)

df 

  id week.born
1    one        NA
2    one        NA
3    one        NA
4    one        NA
5    one        23
6    one        NA
7    one        NA
8    one        NA
9    one        NA
10   one        NA
11   two        NA
12   two        NA
13   two        NA
14   two        NA
15   two        19
16   two        NA
17   two        NA
18   two        NA
19   two        NA
20   two        NA
21 three        NA
22 three        NA
23 three        NA
24 three        NA
25 three        NA
26 three        NA
27 three        NA
28 three        24
29 three        NA
30 three        NA

For one all week.born values should be 23. For two all week.born values should be 19. For one all week.born values should be 24.

Whats the best way to do this?

what if there are two or more non NA values in week.born column for a single id? — Colonel Beauvel, Apr 22 '15 at 19:33
possible duplicate of [Replace values in data frame with other values according to a rule](http://stackoverflow.com/questions/20565949/replace-values-in-data-frame-with-other-values-according-to-a-rule) — Sam Firke, Apr 22 '15 at 19:39

score 6 · Accepted Answer · edited Apr 22 '15 at 21:49

6

I would create another data.frame containing the mapping and then do a simple join:

require(dplyr)
map <- data.frame(id=c("one","two","three"), new.week.born=c(23,19,24))
left_join(df, map, by="id")

# id week.born new.week.born
# 1    one        NA            23
# 2    one        NA            23
# ...
# 16   two        NA            19
# 17   two        NA            19
# 18   two        NA            19
# 19   two        NA            19
# 20   two        NA            19
# 21 three        NA            24
# 22 three        NA            24
# 23 three        NA            24
# ...

See benchmark below.

library(microbenchmark)
library(dplyr) # v 0.4.1
library(data.table) # v 1.9.5

df <- data.frame(id = rep(c("one", "two", "three"), each = 1e6))
df2 <- copy(df)
map <- data.frame(id=c("one","two","three"), new.week.born=c(23,19,24))

dplyr_join <- function() { 
  left_join(df, map, by="id")
}

r_merge <- function() {
  merge(df, map, by="id")
}

data.table_join <- function() {
  setkey(setDT(df2))[map]
}

Unit: milliseconds
              expr         min         lq       mean     median         uq       max neval
      dplyr_join()   409.10635   476.6690   910.6446   489.4573   705.4021  2866.151    10
         r_merge() 41589.32357 47376.0741 55719.1752 50133.0918 54636.3356 83562.931    10
 data.table_join()    94.14621   132.3788   483.4220   225.3309  1051.7916  1416.946    10

edited Apr 22 '15 at 21:49

David Arenburg

91,361
17
137
196

answered Apr 22 '15 at 19:33

faidherbard

983
8
19

You did a bit strange benchmark I might say. Why did you recreate each data set in each function? Also, Jan solution can be shorten to just `setkey(setDT(df))[lkp]`. Finally, what `data.table` version do you have? – David Arenburg Apr 22 '15 at 20:48
Made the comparison with dplyr_0.4.1 and data.table_1.9.4. I've re-created the dataset in each function to avoid having it modified by the data.table package. – faidherbard Apr 22 '15 at 20:53
My benchmark on `data.table` v 1.9.5 came much faster than dplyrs. But you need shorten the code. – David Arenburg Apr 22 '15 at 20:55
Ok for setkey(setDT(df))[lkp] however it doesn't improve the performances and it makes is even more tricky to understand the code. That was my point with dplyr which keeps it quite simple with the left_join syntax and which doesn't put any prequisite on the data.frame (no key required) – faidherbard Apr 22 '15 at 20:57
Ok I'll update my data.table package and check it out – faidherbard Apr 22 '15 at 20:57
Either way, there is no need to create the data set in each function. And when using `setDT` don't forget to do `df2 <- copy(df)` before. – David Arenburg Apr 22 '15 at 20:58
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/75980/discussion-between-faidherbard-and-david-arenburg). – faidherbard Apr 22 '15 at 21:00
1

faidherbard, your point about joins has been extensively covered in [this data.table vs dplyr post](http://stackoverflow.com/a/27718317/559784). – Arun Apr 23 '15 at 10:20

score 3 · Answer 2 · answered Apr 22 '15 at 19:35

3

You can do:

library(data.table)
setDT(df)[,week.born:=week.born[!is.na(week.born)][1], by=id]

Or base R using ave:

df$week.born = with(df, ave(week.born, id, FUN=function(u) u[!is.na(u)][1]))

answered Apr 22 '15 at 19:35

Colonel Beauvel

30,423
11
47
87

score 2 · Answer 3 · answered Apr 22 '15 at 19:30

2

One solution is:

df$week.born[df$id == "one"] <- 23
df$week.born[df$id == "two"] <- 19
df$week.born[df$id == "three"] <- 24

Regards

answered Apr 22 '15 at 19:30

cho7tom

1,030
2
13
30

score 2 · Answer 4 · edited Apr 22 '15 at 20:51

@cho7tom is OK if you have just few groups, otherwise you may prefer to have a lookup table and make a join to that table to lookup week.born value based on id.

base R

df <- data.frame(id = rep(c("one", "two", "three"), each = 10))
lkp <- data.frame(id=c("one","two","three"), week.born=c(23,19,24))
merge(df, lkp, by="id")

Or using binary join from data.table

library(data.table)
setkey(setDT(df))[lkp]

score 0 · Answer 5 · answered Apr 23 '15 at 12:34

0

When mapping just a few combinations like this, the mapvalues function from the plyr package is simple:

library(plyr)
df$week.born <- mapvalues(df$id, c("one", "two", "three"), c(23, 19, 24))

answered Apr 23 '15 at 12:34

Sam Firke

21,571
9
87
105

Replace values in a data.frame column based on values in a different column

5 Answers5