merging two dataframes based on one column without duplicating rows and preserving more data

Question

My goal is to merge two large dataframes based on column genus, but with the special condition of not duplicating rows (not solved in first try); and also preserving more information from both dataframes (not solved in second try), please see desired output:

chromdata <- read.table(text="
 genus sp
1      Acosta       Acosta_1
2    Aguilera     Aguilera_1
3      Acosta       Acosta_2
4    Aguilera     Aguilera_2
5       other              1   # EDIT: new rows    
6       other              2",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)

treedata <- read.table(text="
 genus sp
1      Acosta       Acosta_3
2    Aguilera     Aguilera_3
3      Acosta       Acosta_4
4    Aguilera     Aguilera_4
5       other              3",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)

#First try
merge(chromdata,treedata, by="genus", all=F)

#Second try
chromdata$sp2<-treedata$sp[match(chromdata$genus, treedata$genus)]
chromdata
     genus         sp        sp2
1   Acosta   Acosta_1   Acosta_3
2 Aguilera Aguilera_1 Aguilera_3
3   Acosta   Acosta_2   Acosta_3 #Acosta_4 missing
4 Aguilera Aguilera_2 Aguilera_3 # Aguilera_4 missing
5    other          1          3
6    other          2          3

Desired Output:

     genus         sp        sp2
1   Acosta   Acosta_1   Acosta_3
2 Aguilera Aguilera_1 Aguilera_3
3   Acosta   Acosta_2   Acosta_4
4 Aguilera Aguilera_2 Aguilera_4
5    other          1          3 # EDIT: new rows
6    other          2          3

score 1 · Answer 1 · answered Oct 05 '18 at 19:08

1

You can add another column to merge on:

library(data.table)
merge(
  transform(chromdata, r = rowid(genus)), 
  transform(treedata, r = rowid(genus)), 
  by=c("r", "genus")
)

  r    genus       sp.x       sp.y
1 1   Acosta   Acosta_1   Acosta_3
2 1 Aguilera Aguilera_1 Aguilera_3
3 2   Acosta   Acosta_2   Acosta_4
4 2 Aguilera Aguilera_2 Aguilera_4

You could also get rowid like ave(genus, genus, FUN = seq_along) or many other ways if you don't want to load data.table.

answered Oct 05 '18 at 19:08

Frank

66,179
8
96
180

I found a case in which the answer does not work, see edit. – Ferroao Nov 09 '18 at 11:28
1

solved adding , all.x=T) in your function; library(tidyverse); df %>% group_by(genus) %>% fill(sp.y) – Ferroao Nov 09 '18 at 12:23

score 0 · Answer 2 · answered Oct 05 '18 at 20:12

I want to elaborate more on the data.table approach.

First of all, you could read your data then directly transform it to a data.table object:

library(data.table)

chromdata <- as.data.table(read.table(text="
 genus sp
                        1      Acosta       Acosta_1
                        2    Aguilera     Aguilera_1
                        3      Acosta       Acosta_2
                        4    Aguilera     Aguilera_2",header=TRUE,fill=TRUE,stringsAsFactors=FALSE))

treedata <- as.data.table(read.table(text="
                       genus sp
                       1      Acosta       Acosta_3
                       2    Aguilera     Aguilera_3
                       3      Acosta       Acosta_4
                       4    Aguilera     Aguilera_4",header=TRUE,fill=TRUE,stringsAsFactors=FALSE))

After that, you need an extra column for the merge operation required to achieve your desired output :

chromdata[, N := seq_len(.N), genus]
treedata[, N := seq_len(.N), genus]

These lines gives you the row ids within groups.

Lastly, with the help of data.table package, we can merge these two tables on common columns:

chromdata[treedata, on = c("genus", "N")]

The final output :

      genus         sp N       i.sp
1:   Acosta   Acosta_1 1   Acosta_3
2: Aguilera Aguilera_1 1 Aguilera_3
3:   Acosta   Acosta_2 2   Acosta_4
4: Aguilera Aguilera_2 2 Aguilera_4

@Ferroao I know, as I said I wanted to elaborate more on data.table perspective — Cem, Oct 10 '18 at 07:30

merging two dataframes based on one column without duplicating rows and preserving more data

2 Answers2