0

I want to merge several data.frames with some common columns and append a suffix to the column names to keep track from where does the data for each column come from.

I can do it easily with the suffix term in the first full_join, but when I do the second join, no suffixes are added. I can rename the third data.frame so it has suffixes, but I wanted to know if there is another way of doing it using the suffix term.

Here is an example code:

x = data.frame(col1 = c("a","b","c"), col2 = 1:3, col3 = 1:3)
y = data.frame(col1 = c("b","c","d"), col2 = 4:6, col3 = 1:3)
z = data.frame(col1 = c("c","d","a"), col2 = 7:9, col3 = 1:3)

> df = full_join(x, y, by = "col1", suffix = c("_x","_y")) %>% 
  full_join(z, by = "col1", suffix = c("","_z")) 

> df
  col1 col2_x col3_x col2_y col3_y col2 col3
1    a      1      1     NA     NA    9    3
2    b      2      2      4      1   NA   NA
3    c      3      3      5      2    7    1
4    d     NA     NA      6      3    8    2

I was expecting that col2 and col3 from data.frame z would have a "_z" suffix. I have tried using empty suffixes while merging two data.frames and it works.

I can work around by renaming the columns in z before doing the second full_join, but in my real data I have several common columns, and if I wanted to merge more data.frames it would complicate the code. This is my expected output.

> colnames(z) = paste0(colnames(z),"_z")

> df = full_join(x, y, by = "col1", suffix = c("_x","_y")) %>% 
  full_join(z, by = c("col1"="col1_z"))

> df
  col1 col2_x col3_x col2_y col3_y col2_z col3_z
1    a      1      1     NA     NA      9      3
2    b      2      2      4      1     NA     NA
3    c      3      3      5      2      7      1
4    d     NA     NA      6      3      8      2

I have seen other similar problems in which adding an extra column to keep track of the source data.frame is used, but I was wondering why does not the suffix term work with multiple joins.

PS: If I keep the first suffix empty, I can add suffixes in the second join, but that will leave the col2 and col3 form x without suffix.

> df = full_join(x, y, by = "col1", suffix = c("","_y")) %>% 
  full_join(z, by = "col1", suffix = c("","_z"))

> df
  col1 col2 col3 col2_y col3_y col2_z col3_z
1    a    1    1     NA     NA      9      3
2    b    2    2      4      1     NA     NA
3    c    3    3      5      2      7      1
4    d   NA   NA      6      3      8      2
Cettt
  • 11,460
  • 7
  • 35
  • 58
eggrandio
  • 135
  • 11
  • Thanks for your reply, but in my real dataset I have many columns, calling them by name is tedious, and calling them by index can be messed up when I start merging other datasets and indices change. I was surprised that the second suffix term does not work. I just want to add a suffix to the new added columns, that should be simple. – eggrandio Jan 20 '21 at 20:18

1 Answers1

2

You can do it like this:

full_join(x, y, by = "col1", suffix = c("","_y")) %>% 
  full_join(z, by = "col1", suffix = c("_x","_z"))

  col1 col2_x col3_x col2_y col3_y col2_z col3_z
1    a      1      1     NA     NA      9      3
2    b      2      2      4      1     NA     NA
3    c      3      3      5      2      7      1
4    d     NA     NA      6      3      8      2

Adding the suffix for xat the last join should do the trick.

Cettt
  • 11,460
  • 7
  • 35
  • 58