Using group_by and taking the values of specific columns as input for a function

Question

I'm fairly new to posting topics on stack overflow. Usually I can find what ever I need, but in this case I can't seem to find the proper solution. I'm pretty sure it shouldn't be too hard to deal with.

I have large sets of data where I'd like to perform some functions. I can make it work in loops but since I'm trying to improve the efficiency and speed of the script I'm trying a different approach.

Therefor I'd like to use something like dplyr to help me going.

Let's dive into it.

In the table below I have rearranged a small part of the data. Where column key_1 is my key combing invoice, date and amount. Row 1 and 2 are identical which is great but sadly row 3 is not recognized as being identical. Although it should. I used to loop through the invoicnumber and match it on partial stringmatching using grepl. Sadly this is quite an intensive job when you have over a milion rows. The idea is to mutate the column saldi so it sums up the column amount properly meaning the new value would be 1000.

ID	key_1	original_invoice_number	invoice_number	amount	saldi
1	R176000002022-06-021000	RLZ17600000	RLZ17600000	-1000	-1000
2	R176000002022-06-021000	RLZ17600000	RLZ17600000	1000	2000
3	R17600000a12022-06-021000	RLZ17600000a1	RLZ17600000a1	1000	2000
4	TEST10012022-06-022000	TEST1001	TEST1001	5000	5000
5	TEST10012022-06-022000	TEST1001a	TEST1001	5000	5000
6	TEST10012022-06-022000	TEST1001b	TEST1001	-5000	5000

I made a function that does exactly what I wanted to do, when I put in my exact parameters. But when I try to call on this function using dplyr and getting the parameters after using group_by in dplyr I can't seem to pass my parameters properly.

my_function<- function(data, group_invoicenumber, group_amount){
  my_regex <- paste0("\\b",{{group_invoicenumber}})
  temp_df <- data %>%
    filter(str_detect(invoice_number, regex(my_regex, ignore_case = TRUE)) & amount == {{group_amount}}) %>%
    mutate(saldi = sum(amount))
  
  checked_df <- rbind(checked_df, temp_df)
}

checked_df <- data[0,]

data %>%
  group_by(key_1)%>%
  my_function(group_invoicenumber= invoice_number, 
              group_amount= amount)

Every time I run this the objects invoice_number and amount can't be found. Does anyone know how to get the parameters properly through.

Besides that I'm quite new to coding so there might be a much easier solution to get this fixed.

Thanks in advance anywyas!

#In addition to my first post here is how I'd like to see my desired output.

ID	key_1	original_invoice_number	invoice_number	amount	saldi	relation
1	R176000002022-06-021000	RLZ17600000	RLZ17600000	-1000	1000	RLZ17600000
2	R176000002022-06-021000	RLZ17600000	RLZ17600000	1000	1000	RLZ17600000
3	R17600000a12022-06-021000	RLZ17600000a1	RLZ17600000a1	1000	1000	RLZ17600000
4	TEST10012022-06-022000	TEST1001	TEST1001	5000	5000	TEST1001
5	TEST10012022-06-022000	TEST1001a	TEST1001	5000	5000	TEST1001
6	TEST10012022-06-022000	TEST1001b	TEST1001	-5000	5000	TEST1001

Welcome to SO! What is your expected result? Could you post a table? I wouldn't expect row 3 to be equal as it has the additional "a" in the invoice number. In general, I think you need to use `mutate` where you can apply a function. Also, if you have over a million rows, maybe `data.table` is better suited than `dplyr`. — starja, Jun 02 '22 at 11:33
Typo? `muate`. Seeing that makes me wonder if you are actually using this code. While I understand, appreciate, and encourage the practice of truncating the code shown in questions to reduce the problem, that doesn't remove the responsibility of actually testing your own code given to us. — r2evans, Jun 02 '22 at 13:03
FYI, in response to *"I have large sets of data where I'd like to perform some functions"*, I suggest working with lists of frames, using `lapply`. See https://stackoverflow.com/a/24376207/3358227. — r2evans, Jun 02 '22 at 13:04
Could you use regex to manipulate the values in `key_1`, assign the corrected invoice values in new field e.g. `key_2`, and then use `group_by(key_2) %>% summarize(salidi_total = sum(salidi)) ...`? — Skaqqs, Jun 02 '22 at 13:12
Thanks for ur replies! I think @starja is close to a solution that works for me. I have updated the tables in my post. One with the data and one with the desired output. With that I have also added a column original invoice number. This is just to add a little more information. I believe row 1-3 are a match and row 4-6 are a match. But when I group_by invoice_number I can't get my desired output. Besides that I think lapply with data.table is a better solution for me. But then I should still get the parameters through. — Steven Smit, Jun 03 '22 at 08:49

Skaqqs · Answer 1 · 2022-06-02T14:40:27.213

0

library(dplyr)

dat %>%
  mutate(key_2 = gsub("[^R0-9-]", "", key_1)) %>%
  mutate(invoice_number_2 = gsub("[^RLZ0-9]", "", invoice_number)) %>%
  group_by(key_2, invoice_number_2) %>%
  summarize(saldi_total = sum(saldi))

result:

  key_2                    invoice_number_2 saldi_total
  <chr>                    <chr>                  <int>
1 R1760000012022-06-021000 RLZ176000001            2000
2 R176000002022-06-021000  RLZ17600000             1000

data:

dat <- data.frame(key_1 = c("R176000002022-06-021000", "R176000002022-06-021000", "R17600000a12022-06-021000"),
                  invoice_number = c("RLZ17600000","RLZ17600000", "RLZ17600000a1"),
                  amount = c(-1000, 1000, 1000),
                  saldi = c(-1000, 2000, 2000))

edited Jun 02 '22 at 14:40

answered Jun 02 '22 at 14:25

Skaqqs

4,010
1
7
21

Unfortunately gsub like that is not my solution. Since I have highly changing invoice_numbers in a large dataset I just gave one in to make you understand the basic dillema I'm facing. Besides that row 1-3 are supposed to become a group somehow. – Steven Smit Jun 03 '22 at 08:41
Is there a (semi) consistent pattern in `key_1` that you can use to match/look for errors? If you can write a series of logical statements based on how you'd group `key_1` by hand, perhaps you could translate that into a programmatic? Clearly, I don't know what all of your data look like, so I would only be able to guess at this pattern (like I did in my answer). – Skaqqs Jun 07 '22 at 16:01

Using group_by and taking the values of specific columns as input for a function

1 Answers1