Formula based on row value applied across columns with a group by

Question

I'm editing this question as I think this will substantially explain my issue and attempt at what I'm trying to achieve better. I wrote the following to a friend who recommended I go back to the experts here!

I've combined three data sets into one and then I've cleaned up the names through mutate and gotten myself a nice clean data to work with.

From that data, I've been able to select a subset of columns, group by two fields, and then run a mean / sd across all columns using:

GroupedMeans <- cleanuxq2 %>% 
  dplyr:: select(starts_with("X"),("list"),("urespid_0") ("segment")) %>%
  group_by (list, segment)%>%
  summarize(across(
    .cols = is.numeric,
    .fns = list(Mean = mean, SD = sd),
    #    .names = "{col}_{fn}"
  ))

I've been able to go one step further and even create topbox / middlebox / bottom box scores within a function and pass it through to a copy of the code above:

#this function calculates the top/middle/bottom boxes and is then applied across all columns -----
myfunc <- list(
  topbox = ~sum(. > 5)/n(),
  middlebox = ~sum((. ==3)+(. ==4)+(. ==5))/n(),
  bottombox = ~sum(. < 3)/n(),
  X_n = ~n())

#this creates the boxes using the function above --------  
UXQ_Boxes <- cleanuxq2 %>% 
  dplyr:: select(starts_with("X"),("list"),("urespid_0"),("segment")) %>%
  group_by (segment, list)%>%
  summarize(across(
    .cols = is.numeric,
    .fns = myfunc
#    .names = "{col}_{fn}"
  ))

The next step was to translate the text characters into doubles, so I've done this:

#this translates char to double and move into new frame------
cleanuxq3 <- cleanuxq2 %>%
  mutate_at(vars(ends_with("effectiveness")),
                 ~as.double(recode(.,
                                    "Success" = 0,
                                    "Timeout" = 1,
                                    "Abandon" = 2)))%>%
  mutate_at(vars(ends_with("pass_fail")),
            ~as.double(recode(.,
                               "V"=0,
                               "D"=1,
                               "P"=3,
                               "N"=4)))%>%
  mutate_at(vars(ends_with("exp_difficulty")),
            ~as.double(recode(.,
                               "Yes"=0,
                               "No"=1)))

This is where things go crappy for me Here's an example though:

UXQ_Tasks1 <- cleanuxq3 %>% 
  dplyr:: select(("list"),("urespid_0"),("segment"),starts_with("t")) %>%
  group_by (list,segment)%>%
  summarize(
    seconds = (mean(cleanuxq3$t1_time_task))/60,
    UniquePages = (mean(cleanuxq3$t1_unique_pageviews))
    #    .names = "{col}_{fn}"
  )

There are two issues with this, one is it provides inaccurate means when I have both segments in there, it appears to be doing the mean for all the data, not by 'list' and then 'segment'.

The second problem, which if you tell me to just manually do it, I will, is I have repeating columns that increase sequentially; e.g. t1_effectiveness t2_effectiveness t3_effectiveness

What I would ideally like to do is:

be able to group by 'list'
subgroup by 'segment' and then 3 if possible subgroup by the field 't1_pass_fail' (note the t1 ask below)
because I have 12 fields for each 't1' and upto 't10', I'd like to either to a loop so that I can apply the same formula to each variable looping through t1 - t10: if ends_with("effectiveness") sum(.)/n() if ends_with ("satisfied") mean(.)/7 etc I realise 4 is an ask, but I just assume you're far more brilliant than I :)

Here is a version of my datafile

You want to apply sum on *likability column based on value of effectiveness? — zx8754, May 20 '21 at 09:39
It is much easier if you convert your data to long format, see [wide-to-long](https://stackoverflow.com/q/2185252/680068) — zx8754, May 20 '21 at 09:40
Images are not the right way to share data/code. Add them in a reproducible format which is easier to copy. Read about [how to give a reproducible example](http://stackoverflow.com/questions/5963269). — Ronak Shah, May 20 '21 at 14:04
@ronakshah sorry for the image, I tried to copy and paste my DF and it just kept going in as an image. After about 10 tries I just gave up. — mdutton27, May 20 '21 at 20:55

score 0 · Answer 1 · answered May 20 '21 at 10:37

This is a very tentative answer (as you have not provided the expected output and not (yet) answered specific queries in comments). As noted in a comment, this task is best solved by changing the format to long:

library(tidyr)
library(dplyr)
df %>%
  pivot_longer(-gender,
               names_to = "effectiveness", 
               names_pattern = "(t\\d+).*",
               values_to = "value") %>%
  group_by(gender, effectiveness) %>%
  summarise(score = ifelse(value == 0, sum(value)*5, sum(value)*10))
# A tibble: 15 x 3
# Groups:   gender, effectiveness [6]
   gender effectiveness score
   <chr>  <chr>         <dbl>
 1 f      t1               20
 2 f      t1               20
 3 f      t1               10
 4 f      t2               10
 5 f      t2               20
 6 f      t2               20
 7 f      t3               20
 8 f      t3               10
 9 f      t3               10
10 m      t1               10
11 m      t1               20
12 m      t2               40
13 m      t2               40
14 m      t3               30
15 m      t3               30

Data:

df <- data.frame(
  gender = c("f", "m", "m", "f", "f"),
  t1_effectiveness = c(1,0,2,1,0),
  t2_effectiveness = c(0,2,2,1,1),
  t3_effectiveness = c(2,2,1,0,0)
)

Thanks Chris I will try this as soon as I am at the computer. I haven’t answered anyone as I was sleeping, but I appreciate the possible answer. — mdutton27, May 20 '21 at 20:05

score 0 · Answer 2 · answered May 20 '21 at 10:40

Assuming you want to compute the sum of the likeability columns based on the value of the effectiveness column by gender, you could use the following combination of tidyr::pivot_longer() and dplyr::case_when().

First, reshape into long format, "stacking" each combination of effectiveness and likeability. My approach here uses regex groups and is based on this post from the RStudio Community.

After reshaping to long format, the df looks like this:

library(tibble)
library(dplyr)
library(tidyr)

df <- tibble(gender = rep(c("M", "F"), 3),
             t1_effectiveness = c(0, 1, 2, 0, 1, 2),
             t1_likeability = c(5, 6, 2, 4, 7, 2),
             t2_effectiveness = c(2, 1, 0, 0, 2, 1),
             t2_likeability = c(5, 6, 2, 4, 7, 2),
             t3_effectiveness = c(2, 1, 3, 1, 1, 3),
             t3_likeability = c(5, 6, 2, 4, 7, 2))


df_long <- df %>% 
  tidyr::pivot_longer(!gender,
                      names_to = c("set", ".value"),
                      names_pattern = "(.*)_(.*)")

df_long
#> # A tibble: 18 x 4
#>    gender set   effectiveness likeability
#>    <chr>  <chr>         <dbl>       <dbl>
#>  1 M      t1                0           5
#>  2 M      t2                2           5
#>  3 M      t3                2           5
#>  4 F      t1                1           6
#>  5 F      t2                1           6
#>  6 F      t3                1           6
#>  7 M      t1                2           2
#>  8 M      t2                0           2
#>  9 M      t3                3           2
#> 10 F      t1                0           4
#> 11 F      t2                0           4
#> 12 F      t3                1           4
#> 13 M      t1                1           7
#> 14 M      t2                2           7
#> 15 M      t3                1           7
#> 16 F      t1                2           2
#> 17 F      t2                1           2
#> 18 F      t3                3           2

Now we just need to group by gender and run computations conditional on the value of effectiveness, for which I used dplyr::case_when() here. You can modify or add further conditions as needed:

df_long %>% 
  dplyr::group_by(gender) %>% 
  dplyr::mutate(sum = dplyr::case_when(
    
    effectiveness == 0 ~ sum(likeability) * 5,
    effectiveness == 1 ~ sum(likeability) * 10
    # add further conditions as needed
    
  )) %>% 
  
  dplyr::ungroup()
#> # A tibble: 18 x 5
#>    gender set   effectiveness likeability   sum
#>    <chr>  <chr>         <dbl>       <dbl> <dbl>
#>  1 M      t1                0           5   210
#>  2 M      t2                2           5    NA
#>  3 M      t3                2           5    NA
#>  4 F      t1                1           6   360
#>  5 F      t2                1           6   360
#>  6 F      t3                1           6   360
#>  7 M      t1                2           2    NA
#>  8 M      t2                0           2   210
#>  9 M      t3                3           2    NA
#> 10 F      t1                0           4   180
#> 11 F      t2                0           4   180
#> 12 F      t3                1           4   360
#> 13 M      t1                1           7   420
#> 14 M      t2                2           7    NA
#> 15 M      t3                1           7   420
#> 16 F      t1                2           2    NA
#> 17 F      t2                1           2   360
#> 18 F      t3                3           2    NA

^{Created on 2021-05-20 by the reprex package (v2.0.0)}

Thank you I will try this and see if it works. – mdutton27 May 20 '21 at 20:06 — mdutton27, May 20 '21 at 20:06

Formula based on row value applied across columns with a group by

2 Answers2

Linked