0

I have a data frame such as :

DF1

COL1 COL2 COL3
sp1  1    A
sp1  2    B
sp1  3    C
sp1  4    D
sp2  2    E
sp2  4    F
sp3  1    G
sp3  2    H
sp3  4    I
sp3  5    J

and another data frame such as ;

DF2

COL2
1
2
3
4
5
6
7
8

and I would like to add missing COL2 values for each COL1 value within the DF1.

I should then get :

COL1 COL2 COL3
sp1  1    A
sp1  2    B
sp1  3    C
sp1  4    D
sp1  5    NA
sp1  6    NA
sp1  7    NA
sp1  8    NA
sp2  1    NA
sp2  2    E
sp2  3    NA
sp2  4    F
sp2  5    NA
sp2  6    NA
sp2  7    NA
sp2  8    NA
sp3  1    G
sp3  2    H
sp3  3    NA
sp3  4    I
sp3  5    J
sp3  6    NA
sp3  7    NA
sp3  8    NA

Here are the df in dput format :

df1

structure(list(COL1 = c("sp1", "sp1", "sp1", "sp1", "sp2", "sp2", 
"sp3", "sp3", "sp3", "sp3"), COL2 = c(1L, 2L, 3L, 4L, 2L, 4L, 
1L, 2L, 4L, 5L), COL3 = c("A", "B", "C", "D", "E", "F", "G", 
"H", "I", "J")), class = "data.frame", row.names = c(NA, -10L
))

df2

structure(list(COL2 = 1:8), class = "data.frame", row.names = c(NA, 
-8L))
Peter
  • 11,500
  • 5
  • 21
  • 31
chippycentra
  • 3,396
  • 1
  • 6
  • 24

3 Answers3

2

You can use complete().

library(dplyr)
library(tidyr)

df1 %>% group_by(COL1) %>% complete(df2)

   COL1 COL2 COL3
1   sp1    1    A
2   sp1    2    B
3   sp1    3    C
4   sp1    4    D
5   sp1    5 <NA>
6   sp1    6 <NA>
7   sp1    7 <NA>
8   sp1    8 <NA>
9   sp2    1 <NA>
10  sp2    2    E
11  sp2    3 <NA>
12  sp2    4    F
13  sp2    5 <NA>
14  sp2    6 <NA>
15  sp2    7 <NA>
16  sp2    8 <NA>
17  sp3    1    G
18  sp3    2    H
19  sp3    3 <NA>
20  sp3    4    I
21  sp3    5    J
22  sp3    6 <NA>
23  sp3    7 <NA>
24  sp3    8 <NA>
benson23
  • 16,369
  • 9
  • 19
  • 38
  • Very neat solution. Can you explain why it works with `complete(df2)`? Does it automatically mathc by column names, i.e. figures out that df2 only has COL2 and that there is a COL2 in df1 as well, so that's the only possible match? – deschen Feb 23 '22 at 14:06
  • Just theoretically, what if df2 would contain two columns: COL_2 (see the different name!) and COL_whatever? – deschen Feb 23 '22 at 14:07
  • 1
    The documentation of `tidyr::complete` says: "This is a wrapper around expand(), dplyr::full_join() and replace_na()", therefore yes, it automatically "complete" on intersecting column names. If `df2` does not contain a matching name with `df` or has multiple columns, you'll need to do `df1 %>% group_by(COL1) %>% complete(COL2 = df2$COLLLLLLLLLLLLL2)`, where obvisously, `df2$COLLLLLLLLLLLLL2` contains the values `1:8` – benson23 Feb 23 '22 at 14:25
1

dplyr

library(dplyr)
library(tidyr) # crossing
df2 %>%
  crossing(COL1 = df1$COL1) %>%
  left_join(df1, by = c("COL1", "COL2"))
# # A tibble: 24 x 3
#     COL2 COL1  COL3 
#    <int> <chr> <chr>
#  1     1 sp1   A    
#  2     1 sp2   NA   
#  3     1 sp3   G    
#  4     2 sp1   B    
#  5     2 sp2   E    
#  6     2 sp3   H    
#  7     3 sp1   C    
#  8     3 sp2   NA   
#  9     3 sp3   NA   
# 10     4 sp1   D    
# # ... with 14 more rows

base R

eg <- expand.grid(COL1 = unique(df1$COL1), COL2 = unique(df2$COL2), stringsAsFactors = FALSE)
merge(df1, eg, by = c("COL1", "COL2"), all = TRUE)
#    COL1 COL2 COL3
# 1   sp1    1    A
# 2   sp1    2    B
# 3   sp1    3    C
# 4   sp1    4    D
# 5   sp1    5 <NA>
# 6   sp1    6 <NA>
# 7   sp1    7 <NA>
# 8   sp1    8 <NA>
# 9   sp2    1 <NA>
# 10  sp2    2    E
# 11  sp2    3 <NA>
# 12  sp2    4    F
# 13  sp2    5 <NA>
# 14  sp2    6 <NA>
# 15  sp2    7 <NA>
# 16  sp2    8 <NA>
# 17  sp3    1    G
# 18  sp3    2    H
# 19  sp3    3 <NA>
# 20  sp3    4    I
# 21  sp3    5    J
# 22  sp3    6 <NA>
# 23  sp3    7 <NA>
# 24  sp3    8 <NA>
r2evans
  • 141,215
  • 6
  • 77
  • 149
1

Yet another option:

library(tidyverse)
df1 %>%
  full_join(df1 %>% expand(COL1, COL2 = df2$COL2), by = c("COL1", "COL2"))

   COL1 COL2 COL3
1   sp1    1    A
2   sp1    2    B
3   sp1    3    C
4   sp1    4    D
5   sp2    2    E
6   sp2    4    F
7   sp3    1    G
8   sp3    2    H
9   sp3    4    I
10  sp3    5    J
11  sp1    5 <NA>
12  sp1    6 <NA>
13  sp1    7 <NA>
14  sp1    8 <NA>
15  sp2    1 <NA>
16  sp2    3 <NA>
17  sp2    5 <NA>
18  sp2    6 <NA>
19  sp2    7 <NA>
20  sp2    8 <NA>
21  sp3    3 <NA>
22  sp3    6 <NA>
23  sp3    7 <NA>
24  sp3    8 <NA>
deschen
  • 10,012
  • 3
  • 27
  • 50
  • 1
    Curious, is there a reason you prefer (here) nested `%>%`-pipes instead of `df1 %>% expand(...) %>% full_join(df1, ...)`? I've always found nested pipes to have a higher chance of being visually confusing; this one is straight-forward, but the Inception of nested-`%>%` can get dizzying. – r2evans Feb 23 '22 at 14:38
  • 1
    Completely agree that it‘s not an elegant solution. I basically trial & errored a bot and this one worked. But I‘d definitely prefer an unnested solution over this one. – deschen Feb 23 '22 at 14:44