Merge two dataframes by different sum combinations and key columns

Question

Sample data:

Bilagstoptekst <- c("A", "A", "A", "A", "A","B","B","C","C","C","C","C","C","C")
AKT <- c("80","80","80","25","25","25","25","80","80","80","80","80","80","80")
IA <- c("HUVE", "HUVE", "HUBO", "BILÅ", "BILÅ", "BILÅ","BILÅ", "HUBO","HUBO","HUBO","HUBO","HUBO","HUBO","HUBO")
Belob <- c(100,100,50,75,40,60,400,100,100,100,100,100,333,333)
FPT8 <- data.frame(Bilagstoptekst, AKT, IA, Belob)

> FPT8
 Bilagstoptekst AKT   IA   Belob
           A    80   HUVE   100
           A    80   HUVE   100
           A    80   HUBO    50
           A    25   BILÅ    75
           A    25   BILÅ    40
           B    25   BILÅ    60
           B    25   BILÅ   400
           C    80   HUBO   100
           C    80   HUBO   100
           C    80   HUBO   100
           C    80   HUBO   100
           C    80   HUBO   100
           C    80   HUBO   333
           C    80   HUBO   333

Bilagstoptekst <- c("A", "A", "A", "A", "B", "C", "C")
AKT <- c("80", "80", "25", "25", "25", "80", "80")
IA <- c("HUVE", "HUBO", "BILÅ", "BILÅ", "BILÅ", "HUBO", "HUBO")
RegKonto <- c(4,5,7,1,6,3,9)
Psteksnr <- c(1,6,8,2,5,7,9)
Belob_sum <- c(200,50,75,40,460,500,666)
G69 <- data.frame(Bilagstoptekst, AKT, IA, RegKonto, Psteksnr, Belob_sum)

> G69
Bilagstoptekst AKT   IA     RegKonto Psteksnr Belob_sum
          A    80   HUVE        4        1       200
          A    80   HUBO        5        6        50
          A    25   BILÅ        7        8        75
          A    25   BILÅ        1        2        40
          B    25   BILÅ        6        5       460
          C    80   HUBO        3        7       500
          C    80   HUBO        9        9       666

Now, my real dataset is very large.

What I want to do is to merge the RegKonto and Psteksnr from G69 to FPT8.

I have three key columns, that should match each other in the two dataframes:

Bilagstoptekst, AKT, IA.

But I can't just to a left_join using those since there's another rule. The FPT8$Belob should match the G69$Belob_sum. And sometimes it does match (fx in my example data row 3). Sometimes I can find the match by adding all the FPT8$Belob together and match that number (combined with my 3 key columns) with the G69$Belob_sum (fx in row 1 and 2).

But sometimes it's random which rows to add together to find the right match (well in reality it's not random, but it sure feels like it!). Like the last rows with bilagstoptekst == C.

What I'm asking is, if there's a way to add up different combinations and use those for merging.

Expected output:

> FPT8
 Bilagstoptekst AKT   IA   Belob  RegKonto Psteksnr 
           A    80   HUVE   100       4        1 
           A    80   HUVE   100       4        1 
           A    80   HUBO    50       5        6  
           A    25   BILÅ    75       7        8
           A    25   BILÅ    40       1        2
           B    25   BILÅ    60       6        5
           B    25   BILÅ   400       6        5
           C    80   HUBO   100       3        7 
           C    80   HUBO   100       3        7 
           C    80   HUBO   100       3        7 
           C    80   HUBO   100       3        7 
           C    80   HUBO   100       3        7 
           C    80   HUBO   333       9        9
           C    80   HUBO   333       9        9

What I've already tried:

I've spreaded out - for each row of keys - what different values of FPT8$Belob there is.

dt <- as.data.table(FPT8)

dt[, idx := rowid(Bilagstoptekst, AKT, IA)] # creates the timevar

out <- dcast(dt, 
         Bilagstoptekst + AKT + IA~ paste0("Belob", idx),
         value.var = "Belob")

And then I made different combinations of the sums of the FPT8$Belob which I spreaded out:

# Adding together two different FPT8$Belob - all combinations
output <- as.data.frame(combn(ncol(out[,-c(1:3)]), m=2, FUN =function(x) rowSums(out[,-c(1:3)][x])))
names(output) <- paste0("sum_", combn(names(out[,-c(1:3)]), 2, FUN = paste, collapse="_"))

After this I merged forth and back, and I really don't wanna go in to this part, because it was a mess when I had more than 4 different FPT8$Belob per key (the 3 columns). So I definitely need a more smooth way to do this.

Hope someboy can help me.

EDIT: How the rows pair up and a bit more explanation

So my FPT8 data is a bunch of payments (Belob means amount of money). G69 data is the bills. I need to find the right match, but my problem is that sometimes people choose to split their bill up in smaller payments. Therefore the FPT8 data is larger than the G69 data.

Let me explain..

I have 4 key columns to match upon: Bilagstoptekst, AKT, IA and Belob. The 3 first should always find an exact match in the FPT8-data. Sometimes Belob do match up with the Belob_sum in G69 (row by row), sometimes we need a sum combination of the FPT8 Belob rows within the same in Bilagstoptekst, AKT and IA to match with the Belob_sum in G69. Let me try to show it with my sample data below.

FPT8:

Based on my 3 key columns **Bilagstoptekst*, AKT and IA, the first two rows are "the same" (i.e. the same bill payed over two times). I've added and ID column as the first column, which I don't have in my real data. This is just for explanation. So these two rows I call ID=1.

Row number 3 (ID = 2) doesn't pair up with other rows in my sample FPT8 data, since there isn't any other with the combination of key columns (i.e. the person payed the whole bill in once - this one will be easy to match with the G69 bill information).

In the bottum all the Bilagstoptekst==C have the same combination of the three key columns (C, 80 and HUBO). That's the same bill. But these are not the same bill. In this case I can find two matches in the G69 data. How do I know which one is the right one? I look at the FPT8$Belob and the G69$Belob_sum columns.

G69:

So if I'd do this manually I'd try to find different combinations of sums in the FPT8$Belob that matches the G69$Belob_sum together with the other 3 key columns. Fx I can see that the two last rows adds up to 666 in Belob which matches the last row in G69. The other Bilagstoptekst==C, AKT=80 and IA=HUBO matches the second last row in G69 since 100*5=500.

Desired output:

I've added some colors, so I hope it's easier to understand now.

Hi! Can you please add the expected output to your post. You also need to clarify the logic. For example, "_I can find the match by adding all the FPT8$Belob together_" and "_But sometimes it's random which rows to add together to find the right match (well in reality it's not random, but it sure feels like it!)_" is a bit too cryptic for anyone trying to help ;). Try to exemplify each 'rule' explicitly using relevant parts of your input and output. Cheers — Henrik, Jan 04 '19 at 15:45
Do you mean you want to merge based on 4 columns, of which three have the same name and the fourth have different names between datasets? Or am I missing something? — P1storius, Jan 04 '19 at 15:45
I've added the e3xpected output and some images of how they pair up (from Excel). Is it more clear now? :) — Louise Sørensen, Jan 04 '19 at 15:53
Thanks! Can you please also elaborate, in words, the rules which I asked about in the comment. If it "feels random" for you, you can bet it does for us ;) — Henrik, Jan 04 '19 at 15:57
Explain in words what the desired output is in terms of the input. "Merge", "match each other in the two dataframes", "how the rows pair up", etc are not clear. Use enough words, phrases, sentences & references to parts of examples to clearly say what you mean. PS Use text, not images/links, for what can be given in text. — philipxy, Jan 04 '19 at 20:30
Sorry, English is not my first language.. I've tried to explain it in the end (under Edit), I hope this helps. — Louise Sørensen, Jan 07 '19 at 07:59
Hi again. Thanks for the additional explanation. If I understood you correctly: why don't you just add a column of sum 'Belop' to FPT8 and join on that column as well? See [Calculate group mean (or other summary stats) and assign to original data](https://stackoverflow.com/questions/6053620/calculate-group-mean-or-other-summary-stats-and-assign-to-original-data) and [How to join (merge) data frames (inner, outer, left, right)?](https://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right) — Henrik, Jan 07 '19 at 09:52
I did try that, but sometimes it's not just the sum of Belob. Sometimes it's only fx 3 out for 5 rows with the same key columns that match one row in G69, while the sum of the 2 other match another row in G69 — Louise Sørensen, Jan 07 '19 at 10:00
OK. But why then do you only show cases where the sum _do_ match? (5*100 = 500 and 2*333 = 666). Please provide an example of _sufficient_ complexity. You can easily reduce the number of cases which _do_ match (one is enough, trivial). Rather add all the cases necessary to illustrate your _actual problem_. Providing a _relevant minimal_ example is the key to get relevant help. I give up here. — Henrik, Jan 07 '19 at 10:06

MrGumble · Answer 1 · 2019-01-08T08:54:25.223

Hej!

I cannot see any 1 step solution, but with some simple rules we can match them stepwise.

Also, the final output cannot match your nice images, as you haven't included that data (the column reg.nr.).

First, the simplest: Where Belob matches Belob_sum directly because there is only one line:

library(dplyr)

# Rule 1: Easy matching -----

s1 <- inner_join(FPT8, G69, by=c('Bilagstoptekst','AKT','IA','Belob'='Belob_sum'))
not_matched1 <- anti_join(FPT8, s1,by=c('Bilagstoptekst','AKT','IA'))

The last line checks what wasn't matched. So we apply rule 2, use a grouped sum:

# Rule 2: Calculate Belob_sum to match by ---------------
s2 <- not_matched1 %>% group_by(Bilagstoptekst, AKT, IA) %>% 
  mutate(Belob_sum=sum(Belob)) %>%
  inner_join(G69, by=c('Bilagstoptekst','AKT','IA','Belob_sum'))

matched <- bind_rows(s1, s2)

not_matched2 <- anti_join(FPT8, matched, by=c('Bilagstoptekst','AKT','IA'))

Again, we check what wasn't matched and combine the two. Then, rule 3. Now this is tricky and it only works based on the assumption that the payments are divided equally.

# Rule 3: More gætværk ---------------
# We assume the payed amounts are divided *equally*
s3 <- not_matched2 %>% group_by(Bilagstoptekst, AKT, IA, Belob) %>% 
  mutate(Belob_sum=sum(Belob)) %>%
  inner_join(G69, by=c('Bilagstoptekst','AKT','IA','Belob_sum'))

matched <- bind_rows(matched, s3)

not_matched3 <- anti_join(FPT8, matched, by=c('Bilagstoptekst','AKT','IA')) 
# not_matched3 is now empty!

> matched
   Bilagstoptekst AKT   IA Belob RegKonto Psteksnr Belob_sum
1               A  80 HUBO    50        5        6        NA
2               A  25 BILÅ    75        7        8        NA
3               A  25 BILÅ    40        1        2        NA
4               A  80 HUVE   100        4        1       200
5               A  80 HUVE   100        4        1       200
6               B  25 BILÅ    60        6        5       460
7               B  25 BILÅ   400        6        5       460
8               C  80 HUBO   100        3        7       500
9               C  80 HUBO   100        3        7       500
10              C  80 HUBO   100        3        7       500
11              C  80 HUBO   100        3        7       500
12              C  80 HUBO   100        3        7       500
13              C  80 HUBO   333        9        9       666
14              C  80 HUBO   333        9        9       666

Now, if your group 'C' wasn't divided into large groups, you're in a bit of a pickle, and would have to resort to either manually curating the data to identify same groups, or applying some other algorithm to try to match the best groups under the given restrictions.

**Final tip: **

R and dplyr can work with the danish letters, but it is a pisse træls. When referring to a column as a string, it's straightforward:

FPT8[,'Beløb']

However, if you use them in dplyr as variable names, use back ticks:

FPT8 %>% summarise(`Beløb_sum`=sum(`Beløb`))

Update:

I've found a solution that can sum various, unequally divided groups based on an iterative approach. This is an example that you will have to refactor into a 4th step for your solution. But if you can restrict the search to those with matching "Bilagstoptekst", "AKT" and "IA", I guess you should be good to go.

groups <- data.frame(name=letters[1:4], sumsize=c(100,130, 80,99), stringsAsFactors = FALSE)
subpayments <- data.frame(paid=c(50,40,10,50,43,37,20,25,20,15,42,57))
stopifnot(sum(groups$sumsize) == sum(subpayments$paid))
subpayments$id <- 1:nrow(subpayments)

groups <- groups[order(groups$sumsize, decreasing=TRUE),]
subpayments <- subpayments[order(subpayments$paid, decreasing=TRUE),]
subpayments$group <- NA

for (g in seq_along(groups$name)) {
  sumsize <- 0
  #subpayments$tried <- FALSE
  maxsize <- groups$sumsize[g]
  path <- c()
  attemptspath <- list()
  attempts <- vector('logical', nrow(subpayments))
  #attempts[1] <- TRUE
  #attemptspath <- list(1)
  i <- 0

  while (sumsize < maxsize) {
    #browser()
    last_i <- i
    i <- min(which(subpayments$paid <= (maxsize - sumsize) & !attempts & is.na(subpayments$group)))
    if (is.infinite(i)) {
      # current path did not succed, backpeddle and try another route
      #cat('is infinite.', i, 'path', path, '\n')
      #cat('attempts:', attempts, '\n')
      if (length(path) == 0) {
        # at the beginning again and exhausted our attempts
        break
      }
      if (is.infinite(last_i)) {
        attempts[attemptspath[[length(path)+1]]]  <- FALSE
        attemptspath[[length(path)+1]] <- logical(0)
        #last <- path[length(path)]
        #path <- path[-length(path)]
        #sumsize <- sumsize - subpayments$paid[last]

      }
      # backpeddle; remove last attempt and retry
      last <- path[length(path)]
      path <- path[-length(path)]
      sumsize <- sumsize - subpayments$paid[last]
      print(cbind(subpayments, attempts))
      next
    }
    #cat('i:', i, 'path before:', path, ' -- ')
    path <- c(path, i)
    sumsize <- sumsize + subpayments$paid[i]
    #cat('path after:', path, 'sumsize:', sumsize, '\n')
    attemptspath[[length(path)]] <- c(unlist(attemptspath[length(path)]) %||% integer(0), i)
    attempts[i] <- TRUE
    #print(attemptspath)
    #print(cbind(subpayments, attempts))
  }

  if (length(path) > 0)
    subpayments$group[path] <- groups$name[g]
}

print(subpayments)

Thank you, this worked on the sample data. Unfortunately, a lot of my Belob-values in FPT8 are not divided equally. So I still have 11.000 missing values. That's a bit much to do manually. Men tusind tak fordi du gad hjælpe! Og tak for æøå-tippet! — Louise Sørensen, Jan 07 '19 at 09:59
For the unmatched cases, you could use an iterative solution similar to what I suggested here (https://stackoverflow.com/a/54038003/882102). But it will require some best guess. Luckily, you probably won't be guessing among 11,000 values at once, but repeatedly among some 1,000 small subgroups, found in `not_matched3`. — MrGumble, Jan 07 '19 at 10:17
I'm not sure I understand the code you provided on that question. Or I don't know how to translate it to my data — Louise Sørensen, Jan 08 '19 at 07:42
I've noticed that a lot of the matches is the sum of all positive FPT8$Belob (combined witht the 3 key columns) and the sum of all negative FPT8$Belob, So I tried to merge with these two sum colummns, but they don't really find the matches that I can spot manually. I can't see where go wrong here. So I aggregated FPT8$Belob > 0 with sum function and then merged that over on my FPT8 data. Then I merged G69 with FPT8 by all 4 columns in a left_join. But it won't match the belob column. (Did the same with belob < 0) — Louise Sørensen, Jan 08 '19 at 07:57
Your best approach is to make a new question with these new observations. My last update might be used to replace step 3 and generically find matching sets, even if you have negative amounts. But you will have to check. — MrGumble, Jan 08 '19 at 08:56

Merge two dataframes by different sum combinations and key columns

1 Answers1