Mean columns with almost the same name

Question

I have a data frame containing only one row with named columns. The data frame looks somewhat like this:

  poms_tat1 poms_tat2 poms_tat3      tens1      tens2      tens3 ...
1 0.3708821 0.4915922 0.3958195 -0.1139606 -0.1462545 -0.4411494 ...

I need to calculate the mean of all the columns with similar names. The result should look somewhat like this:


  poms_tat    tens      ...
1 0.4194551  -0.2337881667 ...

My first approach was to use a for loop and a nested while loop to find the indices of the relevant columns and then mean those, but unfortunately I couldn't make it work.

I also found this stackoverflow post which seemed promising but the agrep function seems to match columns in my data frame that should not be matched. I wasn't able to fix that using the max.distance parameter. For example it matches "threat1-3" with "reat1-3". I know those variable names are terrible, but unfortunately that's what I have to work with. What makes this even more complicated is that the number of columns in each category isn't always 3.

I hope I was able to articulate my problem well enough. Thank you.

Edit: Here is a reproducible piece of data:

structure(list(poms_tat1 = 0.370882118644872, poms_tat2 = 0.491592168116328, 
    poms_tat3 = 0.395819547420188, tens1 = -0.113960576459638, 
    tens2 = -0.146254484825426, tens3 = -0.44114940169153, bat_ratio1 = 1, 
    isi1 = 0.0944068640061701, isi2 = 0.597785124823513, isi3 = 0.676617801589949, 
    isi4 = 0.143940321201716, sleepqual = 0.378902118888194, 
    se1 = 0.393610946830482, se2 = 0.0991899501072693, se3 = 0.501745206004254, 
    challenge1 = 0.417855447018672, challenge2 = 0.393610946830482, 
    challenge3 = 0.417855447018672, threat1 = -0.13014390184863, 
    threat2 = -0.34027852368936, threat3 = -0.269679944985297, 
    reat1 = 0.565825152115738, reat2 = 0.571605347479646, reat3 = 0.497468338163091, 
    reat4 = 0.484881137876427, reat5 = 0.494727444918154, selfman1 = 0.389249472080761, 
    selfman2 = 0.40609787800914, selfman3 = 0.418121005003545, 
    selfman4 = 0.467099366496914, selfman5 = 0.205356548067582, 
    selfman6 = 0.464385939554693, selfman7 = 0.379071252751718, 
    eli1 = 0.250872603002127, eli2 = 0, eli3 = 0.265908011739155), row.names = 1L, class = "data.frame")

It's essential to share a reproducible piece of data with `dput(head(data))` so that you will have better chances of getting relevant solutions. — Anoushiravan R, Jun 23 '21 at 16:24

score 4 · Answer 1 · answered Jun 23 '21 at 16:47

4

We could use split.default to split based on the substring of column namesinto a list and then loop over the list with sapply, get the rowMeans in base R

sapply(split.default(df1, sub("\\d+$", "", names(df1))), rowMeans, na.rm = TRUE)

answered Jun 23 '21 at 16:47

akrun

874,273
37
540
662

1

Thank you, this solves a problem in one line which I couldn't solve in 30 lines. – Leo Jun 23 '21 at 17:02
1

@AnoushiravanR I guess it should also work when you have grouping – akrun Jun 23 '21 at 17:09

Greg · Answer 2 · 2022-05-05T16:59:22.350

Here's one with tidyr. I only just saw the acceptance of Baraliuh's answer, so my answer here is more for the sake of closure.

library(tidyr)

my_summary <- as.data.frame(sapply(
  X = pivot_longer(
    data = df,
    # Desired columns (all) to summarize.
    cols = everything(),
    # Take each group of columns, which share a common name before different numeric
    # suffixes, and pivot them into multiple rows under a common column by that name.
    names_to = c(
      ".value",
      # Discard anything after the prefix.
      NA
    ),
    # Identify the (optional) numeric suffix.
    names_sep = "\\d*$"
  ),
  # Take the mean of each column; ignore missing values.
  FUN = mean, na.rm = TRUE,
  # Keep as a list, to convert into a data.frame.
  simplify = FALSE
))

In contrast to some alternatives, I do believe my consolidated use of pivoting makes the process cleaner, and the output does achieve the precise format you desire.

   poms_tat       tens bat_ratio       isi sleepqual        se challenge     threat      reat   selfman       eli
1 0.4194313 -0.2337882         1 0.3781875 0.3789021 0.3315154 0.4097739 -0.2467008 0.5229015 0.3899116 0.1722602

With dplyr involved too, the summarization (into a tibble) is even cleaner:

library(tidyr)
library(dplyr)

my_summary <- df %>%
  pivot_longer(
    cols = everything(),
    names_to = c(".value", NA),
    names_sep = "\\d*$"
  ) %>%
  summarize(across(everything(), mean, na.rm = TRUE))

score 3 · Accepted Answer · answered Jun 23 '21 at 16:47

You could either do this by tidyr::pivot_longer, dplyr::mutate, stringr::str_remove, dplyr::group_by, and dplyr::summarise. This would be done like this:

ex_data <- structure(list(poms_tat1 = 0.370882118644872, poms_tat2 = 0.491592168116328, 
               poms_tat3 = 0.395819547420188, tens1 = -0.113960576459638, 
               tens2 = -0.146254484825426, tens3 = -0.44114940169153, bat_ratio1 = 1, 
               isi1 = 0.0944068640061701, isi2 = 0.597785124823513, isi3 = 0.676617801589949, 
               isi4 = 0.143940321201716, sleepqual = 0.378902118888194, 
               se1 = 0.393610946830482, se2 = 0.0991899501072693, se3 = 0.501745206004254, 
               challenge1 = 0.417855447018672, challenge2 = 0.393610946830482, 
               challenge3 = 0.417855447018672, threat1 = -0.13014390184863, 
               threat2 = -0.34027852368936, threat3 = -0.269679944985297, 
               reat1 = 0.565825152115738, reat2 = 0.571605347479646, reat3 = 0.497468338163091, 
               reat4 = 0.484881137876427, reat5 = 0.494727444918154, selfman1 = 0.389249472080761, 
               selfman2 = 0.40609787800914, selfman3 = 0.418121005003545, 
               selfman4 = 0.467099366496914, selfman5 = 0.205356548067582, 
               selfman6 = 0.464385939554693, selfman7 = 0.379071252751718, 
               eli1 = 0.250872603002127, eli2 = 0, eli3 = 0.265908011739155), row.names = 1L, class = "data.frame")
ex_data %>% 
    tidyr::pivot_longer(everything()) %>% 
    dplyr::mutate(
        name = stringr::str_remove(name, '[0-9]$')
    ) %>% 
    dplyr::group_by(name) %>% 
    dplyr::summarise(
        mean = mean(value)
    )
# A tibble: 11 x 2
   name        mean
   <chr>      <dbl>
 1 bat_ratio  1    
 2 challenge  0.410
 3 eli        0.172
 4 isi        0.378
 5 poms_tat   0.419
 6 reat       0.523
 7 se         0.332
 8 selfman    0.390
 9 sleepqual  0.379
10 tens      -0.234
11 threat    -0.247

Alternatively, you could use split.default, together with stringr::str_remove, purrr::map, unlist, purrr::map_dbl, and tibble::enframe as follows:

ex_data %>% 
    split.default(stringr::str_remove(names(.), '[0-9]$')) %>% 
    purrr::map(unlist) %>% 
    purrr::map_dbl(mean) %>% 
    tibble::enframe()
# A tibble: 11 x 2
   name       value
   <chr>      <dbl>
 1 bat_ratio  1    
 2 challenge  0.410
 3 eli        0.172
 4 isi        0.378
 5 poms_tat   0.419
 6 reat       0.523
 7 se         0.332
 8 selfman    0.390
 9 sleepqual  0.379
10 tens      -0.234
11 threat    -0.247

Thank you, this is exactly what I was looking for. It even keeps the column names. — Leo, Jun 23 '21 at 17:01
Yeah I noticed, but my solutions are slightly more verbose. The OP wrote: "I have a data frame containing only one row with named columns.", so I assumed there would only be one row. Else, the `unlist` approach would indeed generate one mean for all rows and columns per split. — Baraliuh, Jun 23 '21 at 17:01
Ok, that's great! If you want the data wide again you could pipe these results to `tidyr::pivot_wider` . — Baraliuh, Jun 23 '21 at 17:03
I'm terribly sorry, I didn't realize that my solution was very similar to yours, otherwise I would not have posted. — Anoushiravan R, Jun 23 '21 at 17:15
I am actually quite happy you did as I just learned about the use of string manipulation when defning groups in `group_by` — Baraliuh, Jun 23 '21 at 17:19
That's very kind of you. I normally take a glance at what has been already posted but don't actually pay close attention to them as it will kill off any creativity and point the mind into a certain direction. Yes you can also do more expensive manipulations in `group_by`. — Anoushiravan R, Jun 23 '21 at 17:23

Anoushiravan R · Answer 4 · 2021-06-25T14:46:55.587

3

You can also use the following solution:

library(dplyr)
library(tidyr)

df %>%
  pivot_longer(everything()) %>% 
  group_by(grp = sub("\\d+$", "", name)) %>%
  summarise(Avg = mean(value, na.rm = TRUE))

   grp          Avg
   <chr>      <dbl>
 1 bat_ratio  1    
 2 challenge  0.410
 3 eli        0.172
 4 isi        0.378
 5 poms_tat   0.419
 6 reat       0.523
 7 se         0.332
 8 selfman    0.390
 9 sleepqual  0.379
10 tens      -0.234
11 threat    -0.247

edited Jun 25 '21 at 14:46

answered Jun 23 '21 at 17:05

Anoushiravan R

21,622
3
18
41

1

Nice use of `sub` in `group_by`, this avoids the need for `mutate`. – Baraliuh Jun 23 '21 at 17:06
1

Thank you very much. Yes I also recently realized we can do a `mutate` in `group_by`. – Anoushiravan R Jun 23 '21 at 17:08
I think this has a weakness, though: since you're doing `sub("\\d+", "", name)` to target any old `"\\d+"`, won't you end up removing _every_ number in `name`? So if you have names with numbers _before_ the suffix, the new column names will get butchered. For example: the columns `99Luftballons1` | `99Luftballons2` | ... | `99Luftballons20` would get unpivoted into the single column named `Luftballons`. Might this be addressed by using the regex `"\\d+$"`? – Greg Jun 25 '21 at 14:38
1

Generally speaking yes, in that case it may lead to erroneous results, so we can add `$` to specify the removal of any digits at the end of the word. I didn't think that OP's data set names may contain suffix numbers. However, it would be a good idea to prepare for unexpected scenarios. Thank you @Greg for you constructive comment. – Anoushiravan R Jun 25 '21 at 14:46
1

Happy to help! Maybe I'm just an eternal pessimist, but whenever I've finished a solution I really like, a little nagging voice materializes to say: "it would be a _shame_ if there were an _edge case_ to destroy your hopes and dreams..." I can either ignore the voice at my own peril, or heed it at my own peril. :D – Greg Jun 25 '21 at 14:53
For a moment I thought I'm reading an excerpt of Hemingway's lol. You must be an avid reader. I can't agree more, better being a pessimist than a naive optimist. But it's the responsibility of OP to define every possible cases that exists in the original data set. I also need to work on regex more as there is so much room for improvement for me in that section. – Anoushiravan R Jun 25 '21 at 15:17
1

I am indeed an avid reader (though I'm not sure I merit any comparisons to Hemingway). And well said on the rest! I'm 99% sure regex is a form of sorcery, and I intend to prove it... – Greg Jul 01 '21 at 21:57
Yes I know some people who do unusual things with regex. I just have to get down to it and learn from the very beginning. I'm not a fan of Hemingway but he was an icon and I think you've got a some sort of writing flair. – Anoushiravan R Jul 01 '21 at 22:42

Mean columns with almost the same name

4 Answers4