rowSums excluding a particular value in a dplyr pipe without modifying the underlying data frame

Question

I have a data frame of questionnaire data in wide format, with each column representing a questionnaire item.

The data looks something like the following:

df <- data.frame(Q1 = c(1, 4, 2, 3, 1, 1, 4, 4, 1, 2), 
             Q2 = c(NA, 3, 1, 4, NA, NA, 3, 4, 1, 2),
             Q3 = c(3, 4, 1, 2, 4, NA, NA, 1, 1, 2),
             Q4 = c(NA, 4, 1, 1, 1, 3, NA, 2, 2, NA))

I want to use the rowSums function to sum up the values in each row that are not "4" and to exclude the NAs and divide the result by the number of non-4 and non-NA columns (using a dplyr pipe). I do not want to replace the 4s in the underlying data frame; I want to leave it as it is.

As I do not know how to divide the result by the number of non-4 and non-NA columns, I have only tried attempting the first part of my question. I have used the following codes to attempt the first part, but it did not work:

library(dplyr)

df <- df %>%
  as.data.frame() %>%
  mutate(sum = rowSums(.[. != 4, ], na.rm = TRUE))

The desired output would look something like the following:

In the screenshot above, the "mean" column is the sum of non-4 and non-NA values divided by the number of non-4 and non-NA columns.

Thanks!

the pipe is `magrittr` not `dplyr`. – s_baldur Jul 18 '18 at 08:56 — s_baldur, Jul 18 '18 at 08:56

score 2 · Answer 1 · answered Jul 18 '18 at 09:01

If we want to do this strictly in dplyr we can use rowwise with do and calculate for each row sum of values which are not 4 and divide them by the length of the values.

library(dplyr)
df %>%
  rowwise() %>%
  do( (.) %>% as.data.frame %>% 
  mutate(mean = sum(.[. != 4], na.rm = TRUE)/length(.[.!=4 & !is.na(.)])))


#    Q1    Q2    Q3    Q4   mean
# * <dbl> <dbl> <dbl> <dbl> <dbl>
# 1  1.00 NA     3.00 NA     2.00
# 2  4.00  3.00  4.00  4.00  3.00
# 3  2.00  1.00  1.00  1.00  1.25
# 4  3.00  4.00  2.00  1.00  2.00
# 5  1.00 NA     4.00  1.00  1.00
# 6  1.00 NA    NA     3.00  2.00
# 7  4.00  3.00 NA    NA     3.00
# 8  4.00  4.00  1.00  2.00  1.50
# 9  1.00  1.00  1.00  2.00  1.25
#10  2.00  2.00  2.00 NA     2.00

EDIT - And after posting the answer, now I realised we can actually use mean

df %>%
  rowwise() %>%
  do( (.) %>% as.data.frame %>% 
  mutate(mean = mean(.[. != 4], na.rm = TRUE)))

What does `do` does? I read the documentation multiple times but still do not understand how it works in this particular context. Thanks! — DTYK, Jul 18 '18 at 09:13
@DTYK `do` can be used on grouped operation, group here being each row. You can read more [here](https://stackoverflow.com/questions/48182815/when-to-use-do-function-in-dplyr) — Ronak Shah, Jul 18 '18 at 09:29

s_baldur · Accepted Answer · 2018-07-18T09:32:59.980

2

sp_mean <- function(x) mean(x[!is.na(x) & x != 4])
df$mean <- 
  df %>%
  apply(1, sp_mean)

df
   Q1 Q2 Q3 Q4 mean
1   1 NA  3 NA 2.00
2   4  3  4  4 3.00
3   2  1  1  1 1.25
4   3  4  2  1 2.00
5   1 NA  4  1 1.00
6   1 NA NA  3 2.00
7   4  3 NA NA 3.00
8   4  4  1  2 1.50
9   1  1  1  2 1.25
10  2  2  2 NA 2.00

Edit 1 - Slightly more robust:

df$mean <- 
  df %>%
  select(matches("^Q\\d+")) %>%
  apply(1, sp_mean)

matches("^Q\\d+") matches column names starting with Q1, Q2,..., Q199, Q200, ...

Edit 2 - Combining my sp_mean() with the Ronak's solution (no do() required?):

df %>%
  rowwise() %>%
  mutate(mean = sp_mean(c(Q1, Q2, Q3, Q4)))

edited Jul 18 '18 at 09:32

answered Jul 18 '18 at 09:01

s_baldur

29,441
4
36
69

Just trying to understand how sp_mean work in `apply`. The function goes through every column in that particular row. For each value that is NA or 4, it is ignored. For all other values, it is taken into account and used for the mean computation. Is my understanding correct? Thanks! – DTYK Jul 18 '18 at 09:28
1

Your understanding is correct. The thing to be careful about is "every column in that particular row", so if you have other columns (or accidentally run the line again) you have to do some sort of select first (a la Edit 1). – s_baldur Jul 18 '18 at 09:30
I don't understand what you mean by "if you have other columns (or accidentally run the line again)." Could you elaborate on that? Thanks! – DTYK Jul 18 '18 at 09:49
1

Acutally in this special case running again is ok.. result is the same.. but try adding some random numeric column to your data.frame and see how the results change because apply() calculates over *every* column. – s_baldur Jul 18 '18 at 09:55

score 2 · Answer 3 · answered Jul 18 '18 at 09:39

2

Using base R you can do:

df$mean = rowMeans(`is.na<-`(df,df==4),T)#or rowMeans(replace(df,df==4,NA),T)
> df
   Q1 Q2 Q3 Q4 mean
1   1 NA  3 NA 2.00
2   4  3  4  4 3.00
3   2  1  1  1 1.25
4   3  4  2  1 2.00
5   1 NA  4  1 1.00
6   1 NA NA  3 2.00
7   4  3 NA NA 3.00
8   4  4  1  2 1.50
9   1  1  1  2 1.25
10  2  2  2 NA 2.00

answered Jul 18 '18 at 09:39

Onyambu

67,392
3
24
53

What does the backticks do? Creating a temporary data frame for use in the `rowMeans` function? Thanks! – DTYK Jul 18 '18 at 09:48
@DTYK the `\`is.na<-\`` is a function in r – Onyambu Jul 18 '18 at 09:49
Huh. It's not an assignment symbol? – DTYK Jul 18 '18 at 09:51
1

it is a replacement function ie `replace(df,df==4,NA)` is the same as `\`is.na<-\`(df,df==4)` – Onyambu Jul 18 '18 at 09:52
`T` instead of `TRUE` can be dangerous. Because `T <- 0`. Neat solution though. – s_baldur Jul 18 '18 at 11:19

rowSums excluding a particular value in a dplyr pipe without modifying the underlying data frame

3 Answers3