15

I am having trouble mutating a subset of rows in dplyr. I am using the chaining command: %>% to say:

data <- data %>%
  filter(ColA == "ABC") %>%
  mutate(ColB = "XXXX")

This works fine but the problems is that I want to be able to select the entire original table and see the mutate applied to only the subset of data I had specified. My problem is that when I view data after this I only see the subset of data and its updated ColB information.

I would also like to know how to do this using data.table.

Thanks.

eddi
  • 49,088
  • 6
  • 104
  • 155
mo_maat
  • 2,110
  • 12
  • 44
  • 72

4 Answers4

15

When you use filter() you are actually removing the rows that do not match the condition you specified, so they will not show up in the final data set.

Does ColB already exist in your data frame? If so,

data %>%
  mutate(ColB = ifelse(ColA == "ABC", "XXXX", ColB))

will change ColB to "XXXX" when ColA == "ABC" and leave it as is otherwise. If ColB does not already exist, then you will have to specify what to do for rows where ColA != "ABC", for example:

data %>%
  mutate(ColB = ifelse(ColA == "ABC", "XXXX", NA))
Kara Woo
  • 3,595
  • 19
  • 31
14

Using data.table, we'd do:

setDT(data)[colA == "ABC", ColB := "XXXX"]

and the values are modified in-place, unlike if-else, which'd copy the entire column to replace just those rows where the condition satisfies.

We call this sub-assign by reference. You can read more about it in the new HTML vignettes.

Arun
  • 116,683
  • 26
  • 284
  • 387
  • Thank you. This worked. And how do I do a mulitple criteria filter? I tried the following but it does not seem to be the correct syntax.`setDT(data)[(colA == "ABC") & (colC =="DEF"), ColB := "XXXX"]` – mo_maat Apr 24 '15 at 02:45
  • @user1991118 that *is* the correct syntax - perhaps you can show a reproducible example. The only thing I can think of is maybe you meant to use `|` instead of `&`. – eddi Apr 24 '15 at 04:22
  • That was indeed the correct syntax. I had used the wrong variable. It is working. – mo_maat Apr 24 '15 at 16:09
2

Another option is to perform a subsequent combination of union and anti-join with the same data. This requires a primary key:

data <- data %>%
  filter(ColA == "ABC") %>%
  mutate(ColB = "XXXX") %>%
  rbind_list(., anti_join(data, ., by = ...))

Example:

mtcars_n <- mtcars %>% add_rownames
mtcars_n %>%
  filter(cyl > 6) %>%
  mutate(mpg = 1) %>%
  rbind_list(., anti_join(mtcars_n, ., by = "rowname"))

This is much slower than probably any other approach, but useful to get quick results by extending your existing pipe.

krlmlr
  • 25,056
  • 14
  • 120
  • 217
1

Just updating (by June 02nd 2022) @krlmlr great answer:

add_rownames() is deprecated, use tibble::rownames_to_column() instead. rbind_list is also deprecated, use bind_rows instead

You might also find a different sequence of rows in your resulting joined dataset, which depending on your aim is quite difficult to correct with dplyr::arrange() afterwards.

An alternative, although slower, is:

mtcars_n <- mtcars %>% 
  add_rownames() %>% 
  filter(cyl > 6) %>%
  mutate(new_col = 1)
mtcars_m <- left_join(x=mtcars, y=mtcars_n)