5

My problem is very similar to the one posted here.

The difference is that they knew the columns that would be conflicting whereas I need a generic method that wont know in advance which columns conflict.

example:

TABLE1
Date             Time    ColumnA    ColumnB
01/01/2013      08:00      10         30
01/01/2013      08:30      15         25
01/01/2013      09:00      20         20
02/01/2013      08:00      25         15
02/01/2013      08:30      30         10
02/01/2013      09:00      35         5

TABLE2
Date           ColumnA    ColumnB    ColumnC
01/01/2013      100        300         1
02/01/2013      200        400         2

Table 2 only has dates and so is applied to all fields in table A that match the date regardless on time.

I would like the merge to sum the conflicting columns into 1. The result should look like this:

TABLE3
Date             Time    ColumnA    ColumnB    ColumnC
01/01/2013      08:00      110         330        1
01/01/2013      08:30      115         325        1
01/01/2013      09:00      120         320        1
02/01/2013      08:00      225         415        2
02/01/2013      08:30      230         410        2
02/01/2013      09:00      235         405        2

At the moment my standard merge just creates duplicate columns of "ColumnA.x", "ColumnA.y", "ColumnB.x", "ColumnB.y".

Any help is much appreciated

Community
  • 1
  • 1
EvilWeebl
  • 689
  • 1
  • 8
  • 16
  • i would probably not merge. i would rbind.fill then aggregate by the key columns with data.table or ddply – frankc Feb 06 '13 at 14:57
  • Sounds good so far, could you elaborate? Merging is about the peak of my abilities so far and haven't used any of those functions yet. – EvilWeebl Feb 06 '13 at 15:03

3 Answers3

4

If I understand correctly, you want a flexible method that does not require knowing which columns exist in each table aside from the columns you want to merge by and the columns you want to preserve. This may not be the most elegant solution, but here is an example function to suit your exact needs:

merge_Sum <- function(.df1, .df2, .id_Columns, .match_Columns){
    merged_Columns <- unique(c(names(.df1),names(.df2)))
    merged_df1 <- data.frame(matrix(nrow=nrow(.df1), ncol=length(merged_Columns)))
    names(merged_df1) <- merged_Columns
    for (column in merged_Columns){
        if(column %in% .id_Columns | !column %in% names(.df2)){
            merged_df1[, column] <- .df1[, column]
        } else if (!column %in% names(.df1)){
            merged_df1[, column] <- .df2[match(.df1[, .match_Columns],.df2[, .match_Columns]), column]
        } else {
            df1_Values=.df1[, column]
            df2_Values=.df2[match(.df1[, .match_Columns],.df2[, .match_Columns]), column]
            df2_Values[is.na(df2_Values)] <- 0
            merged_df1[, column] <- df1_Values + df2_Values
        }
    }
    return(merged_df1)
}

This function assumes you have a table '.df1' that is a master of sorts, and you want to merge data from a second table '.df2' that has rows that match one or more of the rows in '.df1'. The columns to preserve from the master table '.df1' are accepted as an array '.id_Columns', and the columns that provide the match for merging the two tables are accepted as an array '.match_Columns'

For your example, it would work like this:

merge_Sum(table1, table2, c("Date","Time"), "Date")

#   Date       Time  ColumnA ColumnB ColumnC
# 1 01/01/2013 08:00     110     330       1
# 2 01/01/2013 08:30     115     325       1
# 3 01/01/2013 09:00     120     320       1
# 4 02/01/2013 08:00     225     415       2
# 5 02/01/2013 08:30     230     410       2
# 6 02/01/2013 09:00     235     405       2

In plain language, this function first finds the total number of unique columns and makes an empty data frame in the shape of the master table '.df1' to later hold the merged data. Then, for the '.id_Columns', the data is copied from '.df1' into the new merged data frame. For the other columns, any data that exists in '.df1' is added to any existing data in '.df2', where the rows in '.df2' are matched based on the '.match_Columns'

There is probably some package out there that does something similar, but most of them require knowledge of all the existing columns and how to treat them. As I said before, this is not the most elegant solution, but it is flexible and accurate.

Update: The original function assumed a many-to-one relationship between table1 and table2, and the OP requested the allowance of a many-to-none relationship, also. The code has been updated with a slightly less efficient but 100% more flexible logic.

Dinre
  • 4,196
  • 17
  • 26
  • This looks really excellent and I'm going to give it a try but a quick question about the '.id_columns', I understand that it takes Date and Time as they are concrete and not subject to be overwritten but they seem to be the only ones to initially get copied over, what if my table1 had a column called columnZ that was not matched in table2? Would I have to specify in "id_columns" all the columns that dont match? – EvilWeebl Feb 06 '13 at 16:29
  • 1
    No. Only the known columns that you want to preserve need to be in the '.id_Columns' argument. These are the columns you want the function to essentially ignore and just copy straight over. Otherwise, other columns that only exist in one table will copy over fine, but only after calculating presence in both tables and attempting to add the values together. – Dinre Feb 06 '13 at 16:32
  • This is working brilliantly, I'm having a little hiccup where in table 1 if there are rows of a date that table does not have then the values for the common columns are being set to NA instead of taking the value of table 1 but other than that its great. Thanks! – EvilWeebl Feb 06 '13 at 17:23
  • 1
    Hmm. It sounds like to potentially have a 'many:0,1' relationship that I wasn't considering, since I was assuming a 'many:1' relationship. I'll have to see about writing in that logic. – Dinre Feb 06 '13 at 17:27
  • @EvilWeebl, I have addressed the problem in the updated code. Enjoy! – Dinre Feb 06 '13 at 17:58
3

A data.table solution:

dt1 <- data.table(read.table(header=T, text="Date             Time    ColumnA    ColumnB
01/01/2013      08:00      10         30
01/01/2013      08:30      15         25
01/01/2013      09:00      20         20
02/01/2013      08:00      25         15
02/01/2013      08:30      30         10
02/01/2013      09:00      35         5"))

dt2 <- data.table(read.table(header=T, text="Date           ColumnA    ColumnB    ColumnC
01/01/2013      100        300         1
02/01/2013      200        400         2"))

setkey(dt1, "Date")
setkey(dt2, "Date")
# Note: The ColumnC assignment has to be come before the summing operations
# Else it gives out error (see below)
dt1[dt2, `:=`(ColumnC = i.ColumnC, ColumnA = ColumnA + i.ColumnA, 
                        ColumnB = ColumnB + i.ColumnB)]

#          Date  Time ColumnA ColumnB ColumnC
# 1: 01/01/2013 08:00     110     330       1
# 2: 01/01/2013 08:30     115     325       1
# 3: 01/01/2013 09:00     120     320       1
# 4: 02/01/2013 08:00     225     415       2
# 5: 02/01/2013 08:30     230     410       2
# 6: 02/01/2013 09:00     235     405       2

I'm not sure why placing ColumnC assignment on the right end throws this error. Perhaps MatthewDowle could explain the cause for this error.

dt1[dt2, `:=`(ColumnA = ColumnA + i.ColumnA, ColumnB = ColumnB + i.ColumnB, 
                        ColumnC = i.ColumnC)]

Error in `[.data.table`(dt1, dt2, `:=`(ColumnA = ColumnA + i.ColumnA,  : 
  Value of SET_STRING_ELT() must be a 'CHARSXP' not a 'NULL'

Update from v1.8.9 :

o Mixing adding new with updating existing columns into one :=() by group; i.e.,
DT[,:=(existingCol=...,newCol=...), by=...]
now works without error or segfault, #2778 and #2528. Many thanks to Arun for reporting both with reproducible examples. Tests added.

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
Arun
  • 116,683
  • 26
  • 284
  • 387
  • This does look really good but your forgetting that Table2 is going to be a table that I really know nothing about, it could contain columns that match or it may not, therefore I cannot explicitly choose columns to bind. Maybe something like a for loop over matching column names? – EvilWeebl Feb 06 '13 at 15:39
  • 1
    `ColumnC` is being added to `dt1` but ColumnA and ColumnB are being updated. Seems like there's a bug here where this mixed add/update doesn't like the adds at the end for some reason. Thanks! Have filed [bug#2528](https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2528&group_id=240&atid=975) . – Matt Dowle Feb 06 '13 at 15:41
  • Any ideas as to how to apply this without knowing conflicting columns outside of runtime? – EvilWeebl Feb 06 '13 at 16:04
  • 1
    @EvilWeebl, seems that I read in between the lines. I'll give it a thought and write back if I am able to come up with a solution. – Arun Feb 06 '13 at 16:19
  • What is the syntax when the list of columns to operate on (ColumnA, ColumnB, etc.) is too long to specify in text? `cols <- colnames(dt2[,2:4])` and ``dt1[dt2, (cols) := lapply(.SD, function(x){`:=`(x = x + paste0("i.",x))}), .SDcols = cols]`` does not work. – Kayle Sawyer Dec 11 '18 at 23:06
0

I wrote the package safejoin which solves this very succintly

#devtools::install_github("moodymudskipper/safejoin")
library(safejoin)

safe_full_join(df1,df2, by = "Date", conflict = `+`)
#         Date  Time ColumnA ColumnB ColumnC
# 1 01/01/2013 08:00     110     330       1
# 2 01/01/2013 08:30     115     325       1
# 3 01/01/2013 09:00     120     320       1
# 4 02/01/2013 08:00     225     415       2
# 5 02/01/2013 08:30     230     410       2
# 6 02/01/2013 09:00     235     405       2

In case of conflict, the function + is used on pairs of conflicting columns

data

df1 <- read.table(header=T, text="Date             Time    ColumnA    ColumnB
01/01/2013      08:00      10         30
01/01/2013      08:30      15         25
01/01/2013      09:00      20         20
02/01/2013      08:00      25         15
02/01/2013      08:30      30         10
02/01/2013      09:00      35         5")

df2 <- read.table(header=T, text="Date           ColumnA    ColumnB    ColumnC
01/01/2013      100        300         1
02/01/2013      200        400         2")
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
  • 1
    I used that and it is actually nicely working (careful, the 'by' column should have only comparable values when using the conflict = `+` option). Thanks !! – Gildas Jan 24 '20 at 14:39