update valules of duplicated rows, R

Question

I have two dataframes which share many columns, but not all. Generally, I wouold like to merge these two dataframes. This has a slight twist though... Both dataframes have an ID-column. Some IDs are in both dataframes. In dataframe A the status of a certain ID might be 'pending', whereas in dataframe B it might be 'closed'. Generally, B is the more up-to-date dataframe, so I would like merge both dataframes, but keep the status only of the dataframe B.

The dataframes might look like this.

Dataframe A
ID   status     date         someColumnA
1    'open'     01.01.2020   A
2    'closed'   01.01.2020   B
3    'pending'  01.01.2020   C

Dataframe B
ID   status     date         someColumnB
1    'closed'   01.01.2020   rr
2    'closed'   01.01.2020   tt
4    'pending'  01.01.2020   zz

In the end I would like to receive a dataframe like such

ID   status     date         someColumnA someColumnB
1    'closed'   01.01.2020   A           rr
2    'closed'   01.01.2020   B           tt
3    'pending'  01.01.2020   C           -
4    'pending'  01.01.2020   -           zz

Thanks for the help!

This should be a `merge`/join operation, here are two good references for the concept: https://stackoverflow.com/q/1299871/3358272, https://stackoverflow.com/a/6188334/3358272. It may take some experimentation to fully understand the prospects of joins, but it can be a huge benefit with more-complex data transformations/joins. — r2evans, Jun 09 '21 at 15:02

athiebaut · Answer 1 · 2021-07-05T16:06:43.017

0

Using base R functions :

Merge the tables : df <- merge(x = dfB, y = dfA, by.x = "ID", by.y = "ID", all.x = T, , all.y = T)
Remove the unwanted columns : df[c("status.x","data.x")] <- NULL (adapt the column names)
Reorder columns (if needed) : df <- df[,c(5, 4, 1, 2, 3)] (or whatever the order is, just adapt the indices in the vector)

edited Jul 05 '21 at 16:06

answered Jun 09 '21 at 14:33

athiebaut

153
9

score 0 · Accepted Answer · answered Jun 09 '21 at 14:49

Does this work:

library(dplyr)
df1$status[match(df2$ID,df1$ID, nomatch = 0)] <- df2$status[match(df1$ID,df2$ID,nomatch = 0)]
df1
# A tibble: 3 x 4
     ID status    date       someColumnA
  <dbl> <chr>     <chr>      <chr>      
1     1 'closed'  01.01.2020 A          
2     2 'closed'  01.01.2020 B          
3     3 'pending' 01.01.2020 C          
df1 %>% full_join(df2)
Joining, by = c("ID", "status", "date")
# A tibble: 4 x 5
     ID status    date       someColumnA someColumnB
  <dbl> <chr>     <chr>      <chr>       <chr>      
1     1 'closed'  01.01.2020 A           rr         
2     2 'closed'  01.01.2020 B           tt         
3     3 'pending' 01.01.2020 C           NA         
4     4 'pending' 01.01.2020 NA          zz

Data used:

df1
# A tibble: 3 x 4
     ID status    date       someColumnA
  <dbl> <chr>     <chr>      <chr>      
1     1 'closed'  01.01.2020 A          
2     2 'closed'  01.01.2020 B          
3     3 'pending' 01.01.2020 C          
df2
# A tibble: 3 x 4
     ID status    date       someColumnB
  <dbl> <chr>     <chr>      <chr>      
1     1 'closed'  01.01.2020 rr         
2     2 'closed'  01.01.2020 tt         
3     4 'pending' 01.01.2020 zz

This worked perfectly for me. Thanks a lot! I think the match function is quite elegant in this context. — Cruden10, Jun 16 '21 at 07:29

update valules of duplicated rows, R

2 Answers2