Regarding the merge of two dataframes

Question

I have a lot of data that is represented as below. in total there are 13 dafaframes as the one represented below. All have the same columns.

Example of data

There are in total about 500.000 rows and 106 columns in each dataframe. I want to combine them in the following way:

If the first AND second column in a row in df1 are equal to the first and second column in a row i df2 i want to add the two rows together, otherwise i want to add the row to the dataframe.

i Have created the following code for a minimal example (which gives me the wanted result, but really will not work for the scale that im a working at):

dput(df1[,1:5 ]) 
structure(list(C5id = c("100110", "100110", "100110", "100110", 
"100100", "100100", "100100", "100100", "100100", "100100"), 
    Retnavn = c("Braiserede kæber af gris, tomat-skysovs, kartofler, ovnbagte bønner med bacon", 
    "Braiseret okseinderlår, skysovs, kartofler, marinerede rødløg med hyldeblomst", 
    "Cremet champignonsuppe", "Forårsfrikassé med kalv, asparges og forårsløg, kartofler, broccoli", 
    "Hakkebøf, bearnaisesauce, kartofler, ærter", "Farsbrød med gulerødder og ærter, legeret sovs, kartofler og romanescokål", 
    "Fiskefrikadeller med persillesovs, kartofler og juliennegrønt", 
    "Fiskefrikadeller med remouladesovs, kartofler og juliennegrønt", 
    "Forloren hare med vildtsovs, kartofler og tyttebærsylt", 
    "Frikadeller med skysovs, kartofler og sellerichutney"), 
    a2018uge2 = c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2), a2018uge3 = c("2", 
    "2", "2", "2", "2", "2", "2", "2", "2", "2"), a2018uge4 = c("2", 
    "2", "2", "2", "2", "2", "2", "2", "2", "2")), class = "data.frame", row.names = 4:13)
> dput(df2[,1:5 ])
structure(list(C5id = c("100110", "100110", "100100", "100100", 
"100100", "100100", "100100", "100100", "100100", "100100", "100110", 
"100110", "100100", "100100", "100100", "100100", "100100"), 
    Retnavn = c("Braiserede kæber af gris, tomat-skysovs, kartofler, ovnbagte bønner med bacon", 
    "Braiseret okseinderlår, skysovs, kartofler, marinerede rødløg med hyldeblomst", 
    "Cremet champignonsuppe", "Forårsfrikassé med kalv, asparges og forårsløg, kartofler, broccoli", 
    "Hakkebøf, bearnaisesauce, kartofler, ærter", "Hamburgerryg, flødekartofler, blomkål, broccoli og romanesco", 
    "Kylling i karrysovs med æbler og ingefær, kartofler, cherrytomater med løg", 
    "Kylling i sur-sød sovs med peberfugt, kartofler og broccoli", 
    "Kyllingefrikassé med kartofler", "Lammesteg, flødekartofler, ovnbagte grønne bønner med bacon", 
    "Cremet champignonsuppe", "Forårsfrikassé med kalv, asparges og forårsløg, kartofler, broccoli", 
    "Farsbrød med gulerødder og ærter, legeret sovs, kartofler og romanescokål", 
    "Fiskefrikadeller med persillesovs, kartofler og juliennegrønt", 
    "Fiskefrikadeller med remouladesovs, kartofler og juliennegrønt", 
    "Forloren hare med vildtsovs, kartofler og tyttebærsylt", 
    "Frikadeller med skysovs, kartofler og sellerichutney"), 
    a2018uge2 = c(3, 3, 1, 1, 3, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 
    2, 2), a2018uge3 = c("3", "3", "1", "1", "3", "1", "1", "1", 
    "1", "1", "2", "2", "2", "2", "2", "2", "2"), a2018uge4 = c("3", 
    "3", "1", "1", "3", "1", "1", "1", "1", "1", "2", "2", "2", 
    "2", "2", "2", "2")), class = "data.frame", row.names = c("5", 
"6", "7", "8", "9", "10", "11", "12", "13", "14", "61", "71", 
"91", "101", "111", "121", "131"))


df2_before = df2
hej=c()
for (i in 1:length(df2$C5id)) {
  for (j in 1:length(df1$C5id)) {
    if (df2$C5id[i] == df1$C5id[j]  &&  df2$Retnavn[i] == df1$Retnavn[j]) {
      df2[j, 3:8 ] <- as.numeric(df2[i,3:8 ]) + as.numeric(df1[j,3:8 ])
      hej=c(hej,j)
      #df1 = df1[-i, ]
    }
  }
  cat("vi er kommet til:",i,",",j,"\n")
}
df2=rbind(df2,df1[-hej,])

where df1 and df2 are the two dataframes. My problem is that this has to loop through 500.000*500.000 different combination. I have in total 13 dataframes of this size that have to combined, so i would take an absolute eternity.

I was hoping that there would be some sort of vectoriced way to this that might be done before the fall of 2030.

Best regard

ps. I understand that the way i inserted the data in this post might not be the best. But this might be the best i could think of

pps. I have edited the question in regard to MKR comment.

Please read https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610 on how to provide example data on Stack Overflow. — jay.sf, Mar 24 '20 at 08:33
Try to post a minimal example and also the expected output. I don't quite understand what you would like to do. "If the first AND second collumn are equal i want to add the two rows together, otherwise i want to add the row to the dataframe." . To which dataframe? That's confusing to me. — MKR, Mar 24 '20 at 08:39
I'd first put all dataframes with ids into one dataframe `allData <- rbindlist(listOfMyDataframes, idcol = TRUE)`, then do all the math by aggregating. — zx8754, Mar 24 '20 at 08:39

Raphaele Adjerad · Accepted Answer · 2020-03-30T07:21:27.070

I suggest the following :

library(data.table)
df1 <- data.table::setDT(df1)
df2 <- data.table::setDT(df2)
data.table::setkeyv(df1, c("C5id","Retnavn"))
data.table::setkeyv(df2, c("C5id","Retnavn"))

new_df2 <- merge(df1,df2, all.y = TRUE)
cols <- names(new_df2[,3:ncol(new_df2)])
new_df2[, (cols) := lapply(.SD, as.numeric), .SDcols = cols]
new_df2[, (cols) := lapply(.SD, function(i)
                           tidyr::replace_na(i,0)), .SDcols = cols]

sapply(new_df2, class)

You therefore have transformed your variable into numeric:

      C5id     Retnavn a2018uge2.x a2018uge3.x a2018uge4.x a2018uge2.y a2018uge3.y a2018uge4.y 
"character" "character"   "numeric"   "numeric"   "numeric"   "numeric"   "numeric"   "numeric"

Then building on this issue : R: merging columns and the values if they have the same column name with @bgoldst solution:

# First I replace the names of the same variables by replacing ".x" or ".y":
names(new_df2) <- stringr::str_replace(names(new_df2),".[xy]","")

temp = do.call(cbind,lapply(split(as.list(new_df2[,3:ncol(new_df2)]),
                                  names(new_df2[,3:ncol(new_df2)])),
                            function(x) Reduce(`+`,x)));

new_df2 <- cbind(new_df2[,1:2],temp)

Regarding the merge of two dataframes

1 Answers1