0

I have two dataframes which share many columns, but not all. Generally, I wouold like to merge these two dataframes. This has a slight twist though... Both dataframes have an ID-column. Some IDs are in both dataframes. In dataframe A the status of a certain ID might be 'pending', whereas in dataframe B it might be 'closed'. Generally, B is the more up-to-date dataframe, so I would like merge both dataframes, but keep the status only of the dataframe B.

The dataframes might look like this.

Dataframe A
ID   status     date         someColumnA
1    'open'     01.01.2020   A
2    'closed'   01.01.2020   B
3    'pending'  01.01.2020   C

Dataframe B
ID   status     date         someColumnB
1    'closed'   01.01.2020   rr
2    'closed'   01.01.2020   tt
4    'pending'  01.01.2020   zz

In the end I would like to receive a dataframe like such

ID   status     date         someColumnA someColumnB
1    'closed'   01.01.2020   A           rr
2    'closed'   01.01.2020   B           tt
3    'pending'  01.01.2020   C           -
4    'pending'  01.01.2020   -           zz

Thanks for the help!
Cruden10
  • 13
  • 5
  • 1
    This should be a `merge`/join operation, here are two good references for the concept: https://stackoverflow.com/q/1299871/3358272, https://stackoverflow.com/a/6188334/3358272. It may take some experimentation to fully understand the prospects of joins, but it can be a huge benefit with more-complex data transformations/joins. – r2evans Jun 09 '21 at 15:02

2 Answers2

0

Using base R functions :

  1. Merge the tables : df <- merge(x = dfB, y = dfA, by.x = "ID", by.y = "ID", all.x = T, , all.y = T)

  2. Remove the unwanted columns : df[c("status.x","data.x")] <- NULL (adapt the column names)

  3. Reorder columns (if needed) : df <- df[,c(5, 4, 1, 2, 3)] (or whatever the order is, just adapt the indices in the vector)

athiebaut
  • 153
  • 9
0

Does this work:

library(dplyr)
df1$status[match(df2$ID,df1$ID, nomatch = 0)] <- df2$status[match(df1$ID,df2$ID,nomatch = 0)]
df1
# A tibble: 3 x 4
     ID status    date       someColumnA
  <dbl> <chr>     <chr>      <chr>      
1     1 'closed'  01.01.2020 A          
2     2 'closed'  01.01.2020 B          
3     3 'pending' 01.01.2020 C          
df1 %>% full_join(df2)
Joining, by = c("ID", "status", "date")
# A tibble: 4 x 5
     ID status    date       someColumnA someColumnB
  <dbl> <chr>     <chr>      <chr>       <chr>      
1     1 'closed'  01.01.2020 A           rr         
2     2 'closed'  01.01.2020 B           tt         
3     3 'pending' 01.01.2020 C           NA         
4     4 'pending' 01.01.2020 NA          zz         

Data used:

df1
# A tibble: 3 x 4
     ID status    date       someColumnA
  <dbl> <chr>     <chr>      <chr>      
1     1 'closed'  01.01.2020 A          
2     2 'closed'  01.01.2020 B          
3     3 'pending' 01.01.2020 C          
df2
# A tibble: 3 x 4
     ID status    date       someColumnB
  <dbl> <chr>     <chr>      <chr>      
1     1 'closed'  01.01.2020 rr         
2     2 'closed'  01.01.2020 tt         
3     4 'pending' 01.01.2020 zz      
Karthik S
  • 11,348
  • 2
  • 11
  • 25
  • This worked perfectly for me. Thanks a lot! I think the match function is quite elegant in this context. – Cruden10 Jun 16 '21 at 07:29