20

Suppose that I gave a treatment to some column values of a data frame like this:

  id animal weight   height ...
  1    dog     23.0
  2    cat     NA
  3   duck     1.2
  4  fairy     0.2
  5  snake     BAD


df <- data.frame(id = seq(1:5),
             animal = c("dog", "cat", "duck", "fairy", "snake"),
             weight = c("23", NA, "1.2", "0.2",  "BAD"))

Suppose that the treatment require to work in a separately table, and gave as the result, the following data frame that is a subset of the original:

  id animal weight
  2    cat    2.2
  5  snake    1.3

sub_df <- data.frame(id = c(2, 5),
             animal = c("cat", "snake"),
             weight = c("2.2", "1.3"))

Now I want to put all together again, so I use an operation like this:

> df %>%
   anti_join(sub_df, by = c("id", "animal")) %>%
   bind_rows(sub_df)

 id animal weight
 4  fairy    0.2
 1    dog   23.0
 3   duck    1.2
 2    cat    2.2
 5  snake    1.3

Exist some way to do this directly with join operations?

In the case that the subset is just the key column and the variable subject to give a treatment (id, animal weigth) and not the total variables of the original data frame (id, animal, weight, height), how could assemble the subset with the original set?

Cristóbal Alcázar
  • 1,153
  • 14
  • 26

4 Answers4

19

What you describe is a join operation in which you update some values in the original dataset. This is very easy to do with great performance using data.table because of its fast joins and update-by-reference concept (:=).

Here's an example for your toy data:

library(data.table)
setDT(df)             # convert to data.table without copy
setDT(sub_df)         # convert to data.table without copy

# join and update "df" by reference, i.e. without copy 
df[sub_df, on = c("id", "animal"), weight := i.weight]

The data is now updated:

#   id animal weight
#1:  1    dog   23.0
#2:  2    cat    2.2
#3:  3   duck    1.2
#4:  4  fairy    0.2
#5:  5  snake    1.3

You can use setDF to switch back to ordinary data.frame.

talat
  • 68,970
  • 21
  • 126
  • 157
  • [docendo discimus](https://stackoverflow.com/users/3521006/docendo-discimus) works perfect. Now if you know about a dplyr approach, my specific point is that I used two intermediate steps to join the two tables again. First, add the rest of the variables not involved in the treatment table (homogeneus number of columns) and second step to remove the treated observations from the main data frame (with anti_join) and finally add by row the table of the first step. – Cristóbal Alcázar Jul 05 '17 at 17:53
  • lovely answer, but doesn't answer the posted question – Nettle Jan 13 '18 at 17:27
  • 3
    @Nettle, well, since it's accepted by OP, it seems that it did answer the question to a certain degree. Keep in mind that just because OP asked for a dplyr answer doesn't mean a non-dplyr answer is incorrect or not helpful. In this specific case, data.table offers a much better solution than dplyr, in my opinion. – talat Jan 16 '18 at 13:45
  • 2
    @docendo, totally agree with you. Dplyr doesn't seem to have a direct solution for this common task, and your answer is very concise. Variation on original tact, but equally unsatisfying, would be: `df %>% filter(!id %in% sub_df$id) %>% bind_rows(sub_df)` – Nettle Jan 21 '18 at 02:16
5

Remove the na's first, then simply stack the tibbles:

 bind_rows(filter(df,!is.na(weight)),sub_df)
r.user.05apr
  • 5,356
  • 3
  • 22
  • 39
5

For anyone looking for a solution to use in a tidyverse pipeline:

I run into this problem a lot, and have written a short function that uses mostly tidyverse verbs to get around this. It will account for the case when there are additional columns in the original df.

For example, if the OP's df had an additional 'height' column:

library(dplyr)

df <- tibble(id = seq(1:5),
                 animal = c("dog", "cat", "duck", "fairy", "snake"),
                 weight = c("23", NA, "1.2", "0.2",  "BAD"),
                 height = c("54", "45", "21", "50", "42"))

And the subset of data we wanted to join in was the same:

sub_df <- tibble(id = c(2, 5),
                     animal = c("cat", "snake"),
                     weight = c("2.2", "1.3"))

If we used the OP's method alone (anti_join %>% bind_rows), this won't work because of the additional 'height' column in df. An extra step or two is needed.

In this case we could use the following function:

replace_subset <- function(df, df_subset, id_col_names = c()) {

  # work out which of the columns contain "new" data
  new_data_col_names <- colnames(df_subset)[which(!colnames(df_subset) %in% id_col_names)]

  # complete the df_subset with the extra columns from df
  df_sub_to_join <- df_subset %>%
    left_join(select(df, -new_data_col_names), by = c(id_col_names))

  # join and bind rows
  df_out <- df %>%
    anti_join(df_sub_to_join, by = c(id_col_names)) %>%
    bind_rows(df_sub_to_join)

  return(df_out)

}

Now for the results:

replace_subset(df = df , df_subset = sub_df, id_col_names = c("id"))

## A tibble: 5 x 4
#     id animal weight height
#  <dbl> <chr>  <chr>  <chr> 
#1     1 dog    23     54    
#2     3 duck   1.2    21    
#3     4 fairy  0.2    50    
#4     2 cat    2.2    45    
#5     5 snake  1.3    42  

And here's an example using the function in a pipeline:

df %>%
  replace_subset(df_subset = sub_df, id_col_names = c("id")) %>%
  mutate_at(.vars = vars(c('weight', 'height')), .funs = ~as.numeric(.)) %>%
  mutate(bmi = weight / (height^2))

## A tibble: 5 x 5
#     id animal weight height      bmi
#  <dbl> <chr>   <dbl>  <dbl>    <dbl>
#1     1 dog      23       54 0.00789 
#2     3 duck      1.2     21 0.00272 
#3     4 fairy     0.2     50 0.00008 
#4     2 cat       2.2     45 0.00109 
#5     5 snake     1.3     42 0.000737

hope this is helpful :)

MaxL
  • 101
  • 1
  • 4
5

Isn't dplyr::rows_update exactly what we need here? The following code should work:

df %>% dplyr::rows_update(sub_df, by = "id")

This should work as long as there is a unique identifier (one or multiple variables) for your datasets.

dufei
  • 2,166
  • 1
  • 7
  • 18