0

I'm fairly new to posting topics on stack overflow. Usually I can find what ever I need, but in this case I can't seem to find the proper solution. I'm pretty sure it shouldn't be too hard to deal with.

I have large sets of data where I'd like to perform some functions. I can make it work in loops but since I'm trying to improve the efficiency and speed of the script I'm trying a different approach.

Therefor I'd like to use something like dplyr to help me going.

Let's dive into it.

In the table below I have rearranged a small part of the data. Where column key_1 is my key combing invoice, date and amount. Row 1 and 2 are identical which is great but sadly row 3 is not recognized as being identical. Although it should. I used to loop through the invoicnumber and match it on partial stringmatching using grepl. Sadly this is quite an intensive job when you have over a milion rows. The idea is to mutate the column saldi so it sums up the column amount properly meaning the new value would be 1000.

ID key_1 original_invoice_number invoice_number amount saldi
1 R176000002022-06-021000 RLZ17600000 RLZ17600000 -1000 -1000
2 R176000002022-06-021000 RLZ17600000 RLZ17600000 1000 2000
3 R17600000a12022-06-021000 RLZ17600000a1 RLZ17600000a1 1000 2000
4 TEST10012022-06-022000 TEST1001 TEST1001 5000 5000
5 TEST10012022-06-022000 TEST1001a TEST1001 5000 5000
6 TEST10012022-06-022000 TEST1001b TEST1001 -5000 5000

I made a function that does exactly what I wanted to do, when I put in my exact parameters. But when I try to call on this function using dplyr and getting the parameters after using group_by in dplyr I can't seem to pass my parameters properly.

my_function<- function(data, group_invoicenumber, group_amount){
  my_regex <- paste0("\\b",{{group_invoicenumber}})
  temp_df <- data %>%
    filter(str_detect(invoice_number, regex(my_regex, ignore_case = TRUE)) & amount == {{group_amount}}) %>%
    mutate(saldi = sum(amount))
  
  checked_df <- rbind(checked_df, temp_df)
}

checked_df <- data[0,]

data %>%
  group_by(key_1)%>%
  my_function(group_invoicenumber= invoice_number, 
              group_amount= amount)

Every time I run this the objects invoice_number and amount can't be found. Does anyone know how to get the parameters properly through.

Besides that I'm quite new to coding so there might be a much easier solution to get this fixed.

Thanks in advance anywyas!

#In addition to my first post here is how I'd like to see my desired output.

ID key_1 original_invoice_number invoice_number amount saldi relation
1 R176000002022-06-021000 RLZ17600000 RLZ17600000 -1000 1000 RLZ17600000
2 R176000002022-06-021000 RLZ17600000 RLZ17600000 1000 1000 RLZ17600000
3 R17600000a12022-06-021000 RLZ17600000a1 RLZ17600000a1 1000 1000 RLZ17600000
4 TEST10012022-06-022000 TEST1001 TEST1001 5000 5000 TEST1001
5 TEST10012022-06-022000 TEST1001a TEST1001 5000 5000 TEST1001
6 TEST10012022-06-022000 TEST1001b TEST1001 -5000 5000 TEST1001
  • Welcome to SO! What is your expected result? Could you post a table? I wouldn't expect row 3 to be equal as it has the additional "a" in the invoice number. In general, I think you need to use `mutate` where you can apply a function. Also, if you have over a million rows, maybe `data.table` is better suited than `dplyr`. – starja Jun 02 '22 at 11:33
  • Typo? `muate`. Seeing that makes me wonder if you are actually using this code. While I understand, appreciate, and encourage the practice of truncating the code shown in questions to reduce the problem, that doesn't remove the responsibility of actually testing your own code given to us. – r2evans Jun 02 '22 at 13:03
  • FYI, in response to *"I have large sets of data where I'd like to perform some functions"*, I suggest working with lists of frames, using `lapply`. See https://stackoverflow.com/a/24376207/3358227. – r2evans Jun 02 '22 at 13:04
  • Could you use regex to manipulate the values in `key_1`, assign the corrected invoice values in new field e.g. `key_2`, and then use `group_by(key_2) %>% summarize(salidi_total = sum(salidi)) ...`? – Skaqqs Jun 02 '22 at 13:12
  • Thanks for ur replies! I think @starja is close to a solution that works for me. I have updated the tables in my post. One with the data and one with the desired output. With that I have also added a column original invoice number. This is just to add a little more information. I believe row 1-3 are a match and row 4-6 are a match. But when I group_by invoice_number I can't get my desired output. Besides that I think lapply with data.table is a better solution for me. But then I should still get the parameters through. – Steven Smit Jun 03 '22 at 08:49

1 Answers1

0
library(dplyr)

dat %>%
  mutate(key_2 = gsub("[^R0-9-]", "", key_1)) %>%
  mutate(invoice_number_2 = gsub("[^RLZ0-9]", "", invoice_number)) %>%
  group_by(key_2, invoice_number_2) %>%
  summarize(saldi_total = sum(saldi))

result:

  key_2                    invoice_number_2 saldi_total
  <chr>                    <chr>                  <int>
1 R1760000012022-06-021000 RLZ176000001            2000
2 R176000002022-06-021000  RLZ17600000             1000

data:

dat <- data.frame(key_1 = c("R176000002022-06-021000", "R176000002022-06-021000", "R17600000a12022-06-021000"),
                  invoice_number = c("RLZ17600000","RLZ17600000", "RLZ17600000a1"),
                  amount = c(-1000, 1000, 1000),
                  saldi = c(-1000, 2000, 2000))
Skaqqs
  • 4,010
  • 1
  • 7
  • 21
  • Unfortunately gsub like that is not my solution. Since I have highly changing invoice_numbers in a large dataset I just gave one in to make you understand the basic dillema I'm facing. Besides that row 1-3 are supposed to become a group somehow. – Steven Smit Jun 03 '22 at 08:41
  • Is there a (semi) consistent pattern in `key_1` that you can use to match/look for errors? If you can write a series of logical statements based on how you'd group `key_1` by hand, perhaps you could translate that into a programmatic? Clearly, I don't know what all of your data look like, so I would only be able to guess at this pattern (like I did in my answer). – Skaqqs Jun 07 '22 at 16:01