Summarize multiple columns with strings of values in a table

Question

I have a dataframe such as this, where most columns contain strings of values; the values in columns A_aoi, B_aoi, and C_aoi denote gaze directions (A, B, and C to speakers, * nowhere/elsewhere); the values in columns A_aoi_dur, B_aoi_dur, and C_aoi_dur denote the durations of these gazes:

df
# A tibble: 5 x 7
  speaker A_aoi     A_aoi_dur                            B_aoi B_aoi_dur    C_aoi C_aoi_dur           
  <chr>   <chr>     <chr>                                <chr> <chr>        <chr> <chr>               
1 ID01.B  B*B*B     494,251,416,217,35                   A*    153,1260     A     1413                
2 ID01.A  *B*C*C    445,412,116,533,600,153              A     2259         A*A*A 379,123,1300,144,313
3 ID01.A  B*B*B     1098,249,168,184,526                 A*A   1090,313,822 A*A   817,626,782         
4 ID01.C  C*C*B*    1794,1561,158,208,125,63             C*    2735,1174    *A    152,3757            
5 ID01.B  B*B*C*C*B 1585,1068,249,51,998,352,1016,66,425 *     5810         *B*B* 835,173,3827,661,314

For each speaker (identifiable by the suffixes A, B, and C in column speaker) I want to compute the summed durations and proportions of their gazes directions. The table I want to obtain is this:

Expected result:

  A_aoi Total     Prop B_aoi Total     Prop C_aoi Total      Prop
1     *  5431 34.77843     *  8557 54.79636     *  6021 38.556609
2     B  5533 35.43161     A  4324 27.68955     A  8761 56.102715
3     C  4652 29.78996     C  2735 17.51409     B   834  5.340676

My feeling is that this is best done by converting the dataframe to a long format. So using separate_rows and constructing many intermediate dataframes for each speaker's gazes directions and gaze durations I'v ended up with this convoluted code -- it does what it's supposed to do. But I'm pretty sure there's a more economical and more elegant way!

What would that be? Help is much appreciated!

library(dplyr)
library(tidyr)

### A:
a_dur <- df %>% 
  separate_rows(A_aoi_dur, sep = ",") %>%
  select(A_aoi_dur)
a_aoi <- df %>% 
  separate_rows(A_aoi, sep = "") %>%
  select(A_aoi) %>%
  filter(!A_aoi == "")
A <- cbind(a_dur, a_aoi)

# get grouped total durations and proportions: 
A_stat <- A %>%
  group_by(A_aoi) %>%
  summarise(Total = sum(as.numeric(A_aoi_dur))) %>%
  mutate(Prop = Total/sum(Total)*100)

### B:
b_dur <- df %>% 
  separate_rows(B_aoi_dur, sep = ",") %>%
  select(B_aoi_dur)
b_aoi <- df %>% 
  separate_rows(B_aoi, sep = "") %>%
  select(B_aoi) %>%
  filter(!B_aoi == "")
B <- cbind(b_dur, b_aoi)

# get grouped total durations and proportions: 
B_stat <- B %>%
  group_by(B_aoi) %>%
  summarise(Total = sum(as.numeric(B_aoi_dur))) %>%
  mutate(Prop = Total/sum(Total)*100)


### C:
c_dur <- df %>% 
  separate_rows(C_aoi_dur, sep = ",") %>%
  select(C_aoi_dur)
c_aoi <- df %>% 
  separate_rows(C_aoi, sep = "") %>%
  select(C_aoi) %>%
  filter(!C_aoi == "")
C <- cbind(c_dur, c_aoi)

# get grouped total durations and proportions: 
C_stat <- C %>%
  group_by(C_aoi) %>%
  summarise(Total = sum(as.numeric(C_aoi_dur))) %>%
  mutate(Prop = Total/sum(Total)*100)

# get final table:
cbind(A_stat, B_stat, C_stat)

Reproducible data:

df <- structure(list(speaker = c("ID01.B", "ID01.A", "ID01.A", "ID01.C", 
                                     "ID01.B"), A_aoi = c("B*B*B", "*B*C*C", "B*B*B", "C*C*B*", "B*B*C*C*B"
                                     ), A_aoi_dur = c("494,251,416,217,35", "445,412,116,533,600,153", 
                                                      "1098,249,168,184,526", "1794,1561,158,208,125,63", "1585,1068,249,51,998,352,1016,66,425"
                                     ), B_aoi = c("A*", "A", "A*A", "C*", "*"), B_aoi_dur = c("153,1260", 
                                                                                              "2259", "1090,313,822", "2735,1174", "5810"), C_aoi = c("A", 
                                                                                                                                                      "A*A*A", "A*A", "*A", "*B*B*"), C_aoi_dur = c("1413", "379,123,1300,144,313", 
                                                                                                                                                                                                    "817,626,782", "152,3757", "835,173,3827,661,314")), row.names = c(NA, 
                                                                                                                                                                                                                                                                       -5L), class = c("tbl_df", "tbl", "data.frame"))

That's correct. What I mean is that the columns contain listed values — Chris Ruehlemann, Mar 17 '21 at 15:38
To satisfy your curiosity: the gazes and gaze durations in one row all occur during a single speech event such as an utterance in conversation. In the actual dataframe, the speech events are all tidyly ordered temporally, one occurring after the other. But the events are mostly longer than the associated gazes (because ppl alternate gazes within one and the same utterance). Any good? — Chris Ruehlemann, Mar 17 '21 at 15:59
Another possibility, in the right direction: `library(data.table)`; `setDT(df)`; `d = melt(df, measure = patterns("aoi$", "aoi_dur$"), value.name = c("aoi", "dur"))`; `d2 = splitstackshape::cSplit(d, c("aoi", "dur"), c("", ","), stripWhite = FALSE, direction = "long")`; `dcast(d2, aoi ~ variable, sum)`. I think the numeric representation of 'variable' following `melt` is related to [this issue](https://github.com/Rdatatable/data.table/issues/3396). — Henrik, Mar 17 '21 at 18:04
That's pretty cool (and very concise!) you should post this! — Chris Ruehlemann, Mar 17 '21 at 18:09
Feel free to write up an answer if you want! It's basically 3-4 FAQs in a row: [Reshaping multiple sets of measurement columns (wide format) into single columns (long format)](https://stackoverflow.com/questions/12466493/reshaping-multiple-sets-of-measurement-columns-wide-format-into-single-columns); [Split comma-separated strings in a column into separate rows](https://stackoverflow.com/questions/13773770/split-comma-separated-strings-in-a-column-into-separate-rows); [How to sum a variable by group](https://stackoverflow.com/questions/1660124/how-to-sum-a-variable-by-group) Cheers — Henrik, Mar 17 '21 at 18:17

Baraliuh · Accepted Answer · 2021-03-17T16:38:24.137

One way of doing it (avoiding duplicated column names though):

library(dplyr)
library(purrr)
library(tidyr)
library(stringr)
map_columns <- function(aoi, dur){
  tibble(
    speaker = aoi,
    duration = as.integer(dur)
  )
}

df %>% 
select(-1) %>% #This column seems irrelevant
mutate(
    A_aoi = str_split(A_aoi, ''),
    B_aoi = str_split(B_aoi, ''),
    C_aoi = str_split(C_aoi, ''),
    A_aoi_dur = str_split(A_aoi_dur, ','),
    B_aoi_dur = str_split(B_aoi_dur, ','),
    C_aoi_dur = str_split(C_aoi_dur, ','),
    A_aoi = map2(A_aoi, A_aoi_dur, map_columns),
    B_aoi = map2(B_aoi, B_aoi_dur, map_columns),
    C_aoi = map2(C_aoi, C_aoi_dur, map_columns),
) %>% 
select(1, 3, 5) %>% 
gather() %>% 
unnest(cols = value) %>% 
group_by(key, speaker) %>% 
summarise(
    total = sum(duration)
) %>% 
mutate(
    prop = total/sum(total)*100
) %>% 
ungroup() %>% 
nest(data = -key) %>% 
spread(key, data) %>% 
unnest(cols = c(A_aoi, B_aoi, C_aoi), names_repair = ~paste0(., '_', rep(LETTERS[1:3], each = 3)))

Output:

# A tibble: 3 x 9
  speaker_A total_A prop_A speaker_B total_B prop_B speaker_C total_C prop_C
  <chr>       <int>  <dbl> <chr>       <int>  <dbl> <chr>       <int>  <dbl>
1 *            5431   34.8 *            8557   54.8 *            6021  38.6 
2 B            5533   35.4 A            4324   27.7 A            8761  56.1 
3 C            4652   29.8 C            2735   17.5 B             834   5.34

I can not tell what is going on just from the error. I am guessing that you are missing some measurements. You can check if length(A_aoi) == length(A_aoi_dur) after the string split for example. — Baraliuh, Mar 17 '21 at 16:44
I've noticed that after ` mutate(prop = total/sum(total)*100)`there's already the data I want (in a different format but that is irrelevant). So thanks a lot! — Chris Ruehlemann, Mar 17 '21 at 17:10
I've managed to write the first six `mutate`commands somewhat more succinctly: `across(ends_with('aoi'), ~ str_split(., '')), across(ends_with('dur'), ~ str_split(., ','))` but I failed in doing the same for the last three `mutate` operations. Would you know how to do it? — Chris Ruehlemann, Mar 17 '21 at 17:26
Ah, I see, I was not aware of the `across` command. Glad it worked out! — Baraliuh, Mar 19 '21 at 14:27

score 1 · Answer 2 · answered Mar 17 '21 at 16:26

Here is a shot still need to sort the column a bit at the end but I think it is a tidy version compare with your code though the output is a bit different as it have all the aoi in one columns instead of have 3 columns differently as yours.

library(dplyr)
library(tidyr)
library(purrr)
# Using group_split to separate duration & attention group
split_df <- df %>%
  pivot_longer(cols = contains("aoi"), names_to = "aoi",
    values_to = "aoi_values") %>%
  mutate(aoi_names = if_else(grepl("dur", aoi), "duration", "aoi")) %>%
  group_split(aoi_names)
# For each group apply the same logics you do then combined them together
tidy_df <- bind_cols(split_df[[1]] %>%
    separate_rows(aoi_values, sep = "") %>%
    filter(aoi_values != "") %>%
    select(speaker, aoi, aoi_values),
  split_df[[2]] %>%
    separate_rows(aoi_values, sep = ",") %>%
    mutate(aoi = gsub("_dur", "", aoi)) %>%
    select(duration = aoi_values))
# Finally calculate and pivot wider to have your desire output
tidy_df %>%
  group_by(aoi, aoi_values) %>%
  summarize(total_duration = sum(as.numeric(duration)),
    .groups = "drop") %>%
  group_by(aoi) %>%
  mutate(prop = total_duration / sum(total_duration) * 100) %>%
  pivot_wider(id_cols = aoi_values, names_from = aoi,
    names_glue = "{aoi}_{.value}",
    values_fill = 0,
    values_from = c(total_duration, prop)) %>%
  select(aoi_values, sort(names(.)))

Output

# A tibble: 4 x 7
  aoi_values A_aoi_prop A_aoi_total_duration B_aoi_prop B_aoi_total_duration C_aoi_prop C_aoi_total_duration
  <chr>           <dbl>                <dbl>      <dbl>                <dbl>      <dbl>                <dbl>
1 *                34.8                 5431       54.8                 8557      38.6                  6021
2 B                35.4                 5533        0                      0       5.34                  834
3 C                29.8                 4652       17.5                 2735       0                       0
4 A                 0                      0       27.7                 4324      56.1                  8761

A very nice solution, thank you. Too bad one isn't allowed to accept more than one answer! — Chris Ruehlemann, Mar 17 '21 at 17:57

Summarize multiple columns with strings of values in a table

2 Answers2