Partially replace grouped data in one dataframe with data from other dataframe

Question

I have two dataframes, one large, one small, with the columns only partially shared:

df1 <- data.frame(
  Utt = c("xyzxyz", "hi their", "how ae you?", "xxxxx", "yyzyzyz", "hybsfc"),
  File = c("F01", "F02", "F02", "F03", "F03", "F12"),
  x = 1:6,
  y = LETTERS[1:6],
  z = rnorm(6)
)

df2 <- data.frame(
  Utt = c("hi there", "how are you?"),
  File = c("F02", "F02")
)

Column Utt in df1 contains corrupted data for File == "F02"(the rest of the data in that column is okay). I want to replace the corrupted data with cleaned-up data from column Utt in df2. How can that be done efficiently in dplyr?

A less-than-efficient method is by filtering df1 for File == "F02" and mutateing Uttwith input from the respective column in df2$Utt:

library(dplyr)
df1 %>%
  filter(File == "F02") %>%
  mutate(Utt = df2$Utt)

It's not efficient because the mutated df1 needs to be joined appropriately with the old df1to obtain the desired result:

           Utt File x y          z
1       xyzxyz  F01 1 A -2.5514777
2     hi there  F02 2 B -2.7582295
3 how are you?  F02 3 C  2.1081157
4        xxxxx  F03 4 D  0.1628507
5      yyzyzyz  F03 5 E -1.1904290
6       hybsfc  F12 6 F -1.1244349

Related: [Replace a subset of a data frame with dplyr join operations](https://stackoverflow.com/questions/44930149/replace-a-subset-of-a-data-frame-with-dplyr-join-operations). I note that you are looking for a `dplyr` solution, but for such "update join" is really where `data.table` excels. — Henrik, Jul 29 '21 at 10:55
Also relevant: [Canonical tidyverse method to update some values of a vector from a look-up table](https://stackoverflow.com/questions/67081496/canonical-tidyverse-method-to-update-some-values-of-a-vector-from-a-look-up-tabl) — Henrik, Jul 29 '21 at 11:21

iago · Accepted Answer · 2021-07-29T10:22:08.097

2

What do you think of this method?:

df1 %>% 
  group_by(File) %>% 
  mutate(rn = row_number()) %>% 
  rows_update(df2 %>% mutate(rn = row_number()), by = c("File", "rn")) %>% 
  select(-rn)
# A tibble: 6 × 5
# Groups:   File [4]
  Utt          File      x y         z
  <chr>        <chr> <int> <chr> <dbl>
1 xyzxyz       F01       1 A     0.451
2 hi there     F02       2 B     1.66 
3 how are you? F02       3 C     0.505
4 xxxxx        F03       4 D     0.757
5 yyzyzyz      F03       5 E     1.28 
6 hybsfc       F12       6 F     0.226

edited Jul 29 '21 at 10:22

answered Jul 29 '21 at 09:35

iago

2,990
4
21
27

In the atcual data sets, which are very large, I have no idea how many rows there are prior to the corrupted data. – Chris Ruehlemann Jul 29 '21 at 09:44

score 1 · Answer 2 · answered Jul 29 '21 at 10:50

Maybe like this?

library(dplyr)

tmp <- df1 %>% filter(File %in% df2$File) %>% arrange(File, Utt)
df2 <- df2 %>% arrange(File, Utt)
tmp$Utt <- df2$Utt
bind_rows(tmp, df1 %>% filter(!File %in% df2$File)) %>% arrange(File)

#           Utt File x y           z
#1       xyzxyz  F01 1 A  1.63559945
#2     hi there  F02 2 B  0.72609110
#3 how are you?  F02 3 C -1.27383493
#4        xxxxx  F03 4 D  0.02314656
#5      yyzyzyz  F03 5 E  0.48712123
#6       hybsfc  F12 6 F -0.13702750

Partially replace grouped data in one dataframe with data from other dataframe

2 Answers2