Complex conditional data sorting in r

Question

I have a dataset that is 64rX153c of data. Here are the first 4x22:

structure(list(id = c(20230420, 20230420, 2023042110, 2023042110, 
2023042112, 2023042112, 2023042114, 2023042114, 2023042214, 2023042214
), condition = c("control", "control", "control", "control", 
"forced_break", "forced_break", "forced_break", "forced_break", 
"control", "control"), round1_win = c(1, 0, 1, 0, 0, 1, 1, 0, 
0, 1), round1_sound = c(1, 1, 1, 1, 3, 3, 4, 4, 2, 2), ttrs1 = c(34.8679761886597, 
34.8679761886597, 23.8744249343872, 23.8744249343872, 14.2690608501434, 
14.2690608501434, 17.2876904010773, 17.2876904010773, 17.2002062797546, 
17.2002062797546), ttbp1 = c(42.8691244125366, 42.8691244125366, 
27.8899409770966, 27.8899409770966, 17.2830331325531, 17.2830331325531, 
16.2952466011047, 16.2952466011047, 21.2042264938354, 21.2042264938354
), ttbi1 = c(50.7323212623596, 50.7323212623596, 34.1096398830414, 
34.1096398830414, 38.0370643138885, 38.0370643138885, 36.0967524051666, 
36.0967524051666, 29.5176334381103, 29.5176334381103), round2_win = c(0, 
1, 1, 0, 0, 1, 0, 1, 1, 0), round2_sound = c(1, 1, 1, 1, 1, 1, 
1, 1, 3, 3), ttrs2 = c(53.7323212623596, 53.7323212623596, 37.1107151508331, 
37.1107151508331, 38.0380613803863, 38.0380613803863, 36.0967524051666, 
36.0967524051666, 32.5185968875885, 32.5185968875885), ttbp2 = c(57.7340142726898, 
57.7340142726898, 43.1248035430908, 43.1248035430908, 40.04527759552, 
40.04527759552, 40.1092526912689, 40.1092526912689, 34.5237216949463, 
34.5237216949463), ttbi2 = c(69.872288942337, 69.872288942337, 
47.0448129177094, 47.0448129177094, 59.3737871646881, 59.3737871646881, 
61.5298793315888, 61.5298793315888, 45.1512970924377, 45.1512970924377
), round3_win = c(1, 0, 0, 1, 1, 0, 1, 0, 0, 1), round3_sound = c(2, 
2, 1, 1, 8, 8, 8, 8, 2, 2), ttrs3 = c(72.872288942337, 72.872288942337, 
50.0448129177094, 50.0448129177094, 59.3737871646881, 59.3737871646881, 
61.5298793315888, 61.5298793315888, 48.1512970924377, 48.1512970924377
), ttbp3 = c(79.8788452148437, 79.8788452148437, 55.0583190917969, 
55.0583190917969, 65.3748495578766, 65.3748495578766, 65.538156747818, 
65.538156747818, 54.1600811481476, 54.1600811481476), ttbi3 = c(92.00337266922, 
92.00337266922, 59.4620923995972, 59.4620923995972, 85.2011280059815, 
85.2011280059815, 85.9579682350159, 85.9579682350159, 58.6821427345276, 
58.6821427345276), round4_win = c(1, 0, 0, 1, 1, 0, 1, 0, 1, 
0), round4_sound = c(1, 1, 2, 2, 1, 1, 1, 1, 4, 4), ttrs4 = c(95.00337266922, 
95.00337266922, 62.4620923995972, 62.4620923995972, 85.2011280059815, 
85.2011280059815, 85.9579682350159, 85.9579682350159, 61.6821427345276, 
61.6821427345276), ttbp4 = c(101.018859148026, 101.018859148026, 
66.4772782325744, 66.4772782325744, 89.2028863430023, 89.2028863430023, 
91.9701442718506, 91.9701442718506, 65.6861252784729, 65.6861252784729
), ttbi4 = c(105.467720985413, 105.467720985413, 70.9975862503052, 
70.9975862503052, 109.391041994095, 109.391041994095, 111.622535705566, 
111.622535705566, 71.7297728061676, 71.7297728061676)), row.names = c(NA, 
-10L), class = c("tbl_df", "tbl", "data.frame"))

This represents data from a game with multiple rounds. I want to plot the mean round_sound of player1 by player2. Every row is one of two people that competed against each other, where the same id means the two were part of the same pair. Round_win is binary, 1 for the participant, 0 for the opponent (hence why the round_win is the reverse from p1 and p2 in a pair).

So, I need to be able to take the mean of all round_sound occurrences where round_win == 1 and add it as a new column in the df. I then need to take the mean of all round_sound occurrences where round_win == 0 and add that to another new column. I have a string of the column number for all the relevant round_sound columns, and another for all the round_win columns. I suspect that will help but I don't know how to put it to use.

Edit for Clarity: I think I should be able to do the above with something like rowMeans(dataset[c(3, 8, 13, 18)] **where** dataset[c(4, 9, 14, 19)] == 1). I expect this to mean "find the mean of [the first set of columns] when [the second set of columns] == 1). I do not know how to code in the "where" part. I'm not sure that helps but I hope it does.

tyia

I'd strongly recommend creating data by columns rather than by rows - as you've got it here all your data is quoted , so nothing is `numeric`, it's all `character` class. An early step should be to convert columns that are dates to `Date` class, columns that should be numeric or integer to those classes, etc. — Gregor Thomas, May 02 '23 at 17:40
After that (or even before, really) you should also consider converting your data [from wide format to long format](https://stackoverflow.com/q/2185252/903061), so you have, e.g. a single `round` column with values 1, 2, 3, 4 and single columns for `ttrs`, `ttbp`, and `ttbpi`. This will make the plotting and subsequent transformations much easier. — Gregor Thomas, May 02 '23 at 17:40
Thanks! I imported my data from a csv so the numbers are already numeric (the above is the only way I know how to make reproducible DFs for this site). Regarding long form, would that basically mean: 1.) remove anything that is not a round number and sound level, 2.) transpose the data such that each row is a round and each column is a participant? — grace.cutler, May 02 '23 at 18:37
Ah, there's much easier ways to make reproducible data. `dput()` is your friend here, `dput(your_data[1:10, ])` will make a copy/pasteable version of the first 10 rows of `your_data`, including all class information. Or if you do want to manually write out data, just write it in columns not rows and you won't need to use quotes, e.g. `data.frame(id = c(20230420, 20230420), condition = c("control", "control"), round1_win = c(1,0))`. — Gregor Thomas, May 02 '23 at 18:56
Huh. I have edited with your suggestion. Is the above more helpful? — grace.cutler, May 02 '23 at 18:59

score 1 · Accepted Answer · answered May 02 '23 at 19:34

I'd strongly recommend reading the tidyr vignette on Tidy Data. It's a great framework for thinking about how to format data for plotting and analysis.

Here's my suggestion for a long data format for this problem. In the new data you shared, it seems like the only difference between row pairs is the wins are reversed - all the tt* columns have identical values, which seems unnecessary to I dropped every other row before pivoting.

library(tidyr) 
library(dplyr)
library(stringr)
df_long = df |>
  group_by(id, condition) |>
  slice(1) |> # de-duplicate 
  ungroup() |>
  rename_with(.fn = \(x) { # rename columns to a consistent pattern
    digit = str_extract(x, "[0-9]+$")
    metric = str_replace(x, "[0-9]+$", "")
    paste0("round", digit, "_", metric)
  }, .cols = matches("[0-9]$")) |>
  pivot_longer(  # convert to long format
    starts_with("round"),
    names_prefix = "round",
    names_to = c("round", ".value"),
    names_sep = "_"
  )

df_long
#          id condition    round   win sound  ttrs  ttbp  ttbi
#         <dbl> <chr>        <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#  1   20230420 control      1         1     1  34.9  42.9  50.7
#  2   20230420 control      2         0     1  53.7  57.7  69.9
#  3   20230420 control      3         1     2  72.9  79.9  92.0
#  4   20230420 control      4         1     1  95.0 101.  105. 
#  5 2023042110 control      1         1     1  23.9  27.9  34.1
#  6 2023042110 control      2         1     1  37.1  43.1  47.0
#  7 2023042110 control      3         0     1  50.0  55.1  59.5
#  8 2023042110 control      4         0     2  62.5  66.5  71.0
#  9 2023042112 forced_break 1         0     3  14.3  17.3  38.0
# 10 2023042112 forced_break 2         0     1  38.0  40.0  59.4
# 11 2023042112 forced_break 3         1     8  59.4  65.4  85.2
# 12 2023042112 forced_break 4         1     1  85.2  89.2 109. 
# 13 2023042114 forced_break 1         1     4  17.3  16.3  36.1
# 14 2023042114 forced_break 2         0     1  36.1  40.1  61.5
# 15 2023042114 forced_break 3         1     8  61.5  65.5  86.0
# 16 2023042114 forced_break 4         1     1  86.0  92.0 112. 
# 17 2023042214 control      1         0     2  17.2  21.2  29.5
# 18 2023042214 control      2         1     3  32.5  34.5  45.2
# 19 2023042214 control      3         0     2  48.2  54.2  58.7
# 20 2023042214 control      4         1     4  61.7  65.7  71.7

You can then do nice summaries like this:

df_long |>
  group_by(condition, win) |>
  summarize(across(starts_with("tt"), mean))
# # A tibble: 4 × 5
# # Groups:   condition [2]
#   condition      win  ttrs  ttbp  ttbi
#   <chr>        <dbl> <dbl> <dbl> <dbl>
# 1 control          0  46.3  50.9  57.7
# 2 control          1  51.1  56.4  63.7
# 3 forced_break     0  29.5  32.5  53.0
# 4 forced_break     1  61.9  65.7  85.7

This looks fantastic!! I am at a bday party now so I really should not try it but I will first thing in the morning! — grace.cutler, May 02 '23 at 20:57

Complex conditional data sorting in r

1 Answers1