1

I have a dataset and I am trying to find a solution for it using dplyr. My goal is to summarize the values in the columns value and percentage, but only for the value smaller than 10 and add this to a new item name called: "cheap_stuff", while removing the rows with the low values.

My data looks like this:

df <- data.frame(group=c(rep("A",4), rep("B",4), rep("C",4), rep("D",4)),
                 value=c(1, 23, 15, 5,  3,  45, 7,  21, 4,  8,  26, 30, 3,  9,  37, 68),
                 percentage=c(2.27, 52.27,  34.09,  11.36   ,3.95   ,59.21  ,9.21   ,27.63  ,5.88   ,11.76  ,38.24  ,44.12  ,2.56   ,7.69, 31.62, 58.12),
                 item=c("cheap1","expensive1"   ,"expensive2",  "cheap2",
                 "cheap1",  "expensive1","cheap2","expensive2",
                 "cheap1","cheap2","expensive1","expensive2",
                 "cheap1","cheap2","expensive1","expensive2"))

view(df)
   group value percentage       item
1      A     1       2.27     cheap1
2      A    23      52.27 expensive1
3      A    15      34.09 expensive2
4      A     5      11.36     cheap2
5      B     3       3.95     cheap1
6      B    45      59.21 expensive1
7      B     7       9.21     cheap2
8      B    21      27.63 expensive2
9      C     4       5.88     cheap1
10     C     8      11.76     cheap2
11     C    26      38.24 expensive1
12     C    30      44.12 expensive2
13     D     3       2.56     cheap1
14     D     9       7.69     cheap2
15     D    37      31.62 expensive1
16     D    68      58.12 expensive2

My desired output looks like this:

   group value percentage        item
1      A     6      13.64 cheap_stuff
2      A    23      52.27  expensive1
3      A    15      34.09  expensive2
4      B    10      13.16 cheap_stuff
5      B    45      59.21  expensive1
6      B    21      27.63  expensive2
7      C    12      17.65 cheap_stuff
8      C    26      38.24  expensive1
9      C    30      44.12  expensive2
10     D    12      10.26 cheap_stuff
11     D    37      31.62  expensive1
12     D    68      58.12  expensive2

This post comes in the right direction, Summarize with mathematical conditions in dplyr But, there all values are summed, and a new column is created.

I have tried something like this:

library(dplyr)
df%>%
  group_by(group) %>%
  mutate(item= replace(item, which(value <10),"cheap_stuff")) %>%
  mutate(value = sum(value[value < 10]))

But that fails in the sense that I can not removed the rows that I want, and it write over the rows with expensive values.

# A tibble: 16 × 4
# Groups:   group [4]
   group value percentage item       
   <chr> <dbl>      <dbl> <chr>      
 1 A         6       2.27 cheap_stuff
 2 A         6      52.3  expensive1 
 3 A         6      34.1  expensive2 
 4 A         6      11.4  cheap_stuff
 5 B        10       3.95 cheap_stuff
 6 B        10      59.2  expensive1 
 7 B        10       9.21 cheap_stuff
 8 B        10      27.6  expensive2 
 9 C        12       5.88 cheap_stuff
10 C        12      11.8  cheap_stuff
11 C        12      38.2  expensive1 
12 C        12      44.1  expensive2 
13 D        12       2.56 cheap_stuff
14 D        12       7.69 cheap_stuff
15 D        12      31.6  expensive1 
16 D        12      58.1  expensive2 
Julien
  • 1,613
  • 1
  • 10
  • 26

1 Answers1

2

Using value<10 instead of grepl:

df %>%
  group_by(group,item=case_when(value < 10~"cheap_stuff",
                                T~item)) %>%
  summarise(value=sum(value),
            percentage=sum(percentage))%>%
  ungroup

   group item        value percentage
   <chr> <chr>       <dbl>      <dbl>
 1 A     cheap_stuff     6       13.6
 2 A     expensive1     23       52.3
 3 A     expensive2     15       34.1
 4 B     cheap_stuff    10       13.2
 5 B     expensive1     45       59.2
 6 B     expensive2     21       27.6
 7 C     cheap_stuff    12       17.6
 8 C     expensive1     26       38.2
 9 C     expensive2     30       44.1
10 D     cheap_stuff    12       10.2
11 D     expensive1     37       31.6
12 D     expensive2     68       58.1

Original answer:

df %>%
  group_by(group,item=case_when(grepl("cheap",item,fixed=T)~"cheap_stuff",
                                T~item)) %>%
  summarise(value=sum(value),
            percentage=sum(percentage))

   group item        value percentage
   <chr> <chr>       <dbl>      <dbl>
 1 A     cheap_stuff     6       13.6
 2 A     expensive1     23       52.3
 3 A     expensive2     15       34.1
 4 B     cheap_stuff    10       13.2
 5 B     expensive1     45       59.2
 6 B     expensive2     21       27.6
 7 C     cheap_stuff    12       17.6
 8 C     expensive1     26       38.2
 9 C     expensive2     30       44.1
10 D     cheap_stuff    12       10.2
11 D     expensive1     37       31.6
12 D     expensive2     68       58.1
one
  • 3,121
  • 1
  • 4
  • 24
  • 4
    `stringsAsFactors` has had a default `FALSE` value since `R 4.0.0` released almost 3 years ago. You might want to update. Additionally, OP defined "cheap_stuff" as `value < 10` so you don't need to `grepl()` the categories for a string match. – Ritchie Sacramento Feb 08 '23 at 15:54
  • 1
    Don't forget to `ungroup()`, or add `.groups = 'drop'` into `summarise`. – Darren Tsai Feb 08 '23 at 15:58
  • Yes, that works. is this also possible without using grepl, but by using the condition that the number in my value column is used to rename the item name? In my real dataset, the item can be anything, so grepping is not a real option. (I see that my example data was not correct for answering my question) – Thomas Haverkamp Feb 08 '23 at 16:02
  • can I update questions? – Thomas Haverkamp Feb 08 '23 at 16:03