0

I have a data frame of questionnaire data in wide format, with each column representing a questionnaire item.

The data looks something like the following:

df <- data.frame(Q1 = c(1, 4, 2, 3, 1, 1, 4, 4, 1, 2), 
             Q2 = c(NA, 3, 1, 4, NA, NA, 3, 4, 1, 2),
             Q3 = c(3, 4, 1, 2, 4, NA, NA, 1, 1, 2),
             Q4 = c(NA, 4, 1, 1, 1, 3, NA, 2, 2, NA))

I want to use the rowSums function to sum up the values in each row that are not "4" and to exclude the NAs and divide the result by the number of non-4 and non-NA columns (using a dplyr pipe). I do not want to replace the 4s in the underlying data frame; I want to leave it as it is.

As I do not know how to divide the result by the number of non-4 and non-NA columns, I have only tried attempting the first part of my question. I have used the following codes to attempt the first part, but it did not work:

library(dplyr)

df <- df %>%
  as.data.frame() %>%
  mutate(sum = rowSums(.[. != 4, ], na.rm = TRUE))

The desired output would look something like the following: rowSums excluding particular value

In the screenshot above, the "mean" column is the sum of non-4 and non-NA values divided by the number of non-4 and non-NA columns.

Thanks!

DTYK
  • 1,098
  • 1
  • 8
  • 33

3 Answers3

2

If we want to do this strictly in dplyr we can use rowwise with do and calculate for each row sum of values which are not 4 and divide them by the length of the values.

library(dplyr)
df %>%
  rowwise() %>%
  do( (.) %>% as.data.frame %>% 
  mutate(mean = sum(.[. != 4], na.rm = TRUE)/length(.[.!=4 & !is.na(.)])))


#    Q1    Q2    Q3    Q4   mean
# * <dbl> <dbl> <dbl> <dbl> <dbl>
# 1  1.00 NA     3.00 NA     2.00
# 2  4.00  3.00  4.00  4.00  3.00
# 3  2.00  1.00  1.00  1.00  1.25
# 4  3.00  4.00  2.00  1.00  2.00
# 5  1.00 NA     4.00  1.00  1.00
# 6  1.00 NA    NA     3.00  2.00
# 7  4.00  3.00 NA    NA     3.00
# 8  4.00  4.00  1.00  2.00  1.50
# 9  1.00  1.00  1.00  2.00  1.25
#10  2.00  2.00  2.00 NA     2.00

EDIT - And after posting the answer, now I realised we can actually use mean

df %>%
  rowwise() %>%
  do( (.) %>% as.data.frame %>% 
  mutate(mean = mean(.[. != 4], na.rm = TRUE)))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • What does `do` does? I read the documentation multiple times but still do not understand how it works in this particular context. Thanks! – DTYK Jul 18 '18 at 09:13
  • 1
    @DTYK `do` can be used on grouped operation, group here being each row. You can read more [here](https://stackoverflow.com/questions/48182815/when-to-use-do-function-in-dplyr) – Ronak Shah Jul 18 '18 at 09:29
2
sp_mean <- function(x) mean(x[!is.na(x) & x != 4])
df$mean <- 
  df %>%
  apply(1, sp_mean)

df
   Q1 Q2 Q3 Q4 mean
1   1 NA  3 NA 2.00
2   4  3  4  4 3.00
3   2  1  1  1 1.25
4   3  4  2  1 2.00
5   1 NA  4  1 1.00
6   1 NA NA  3 2.00
7   4  3 NA NA 3.00
8   4  4  1  2 1.50
9   1  1  1  2 1.25
10  2  2  2 NA 2.00

Edit 1 - Slightly more robust:

df$mean <- 
  df %>%
  select(matches("^Q\\d+")) %>%
  apply(1, sp_mean)

matches("^Q\\d+") matches column names starting with Q1, Q2,..., Q199, Q200, ...

Edit 2 - Combining my sp_mean() with the Ronak's solution (no do() required?):

df %>%
  rowwise() %>%
  mutate(mean = sp_mean(c(Q1, Q2, Q3, Q4)))
s_baldur
  • 29,441
  • 4
  • 36
  • 69
  • Just trying to understand how sp_mean work in `apply`. The function goes through every column in that particular row. For each value that is NA or 4, it is ignored. For all other values, it is taken into account and used for the mean computation. Is my understanding correct? Thanks! – DTYK Jul 18 '18 at 09:28
  • 1
    Your understanding is correct. The thing to be careful about is "every column in that particular row", so if you have other columns (or accidentally run the line again) you have to do some sort of select first (a la Edit 1). – s_baldur Jul 18 '18 at 09:30
  • I don't understand what you mean by "if you have other columns (or accidentally run the line again)." Could you elaborate on that? Thanks! – DTYK Jul 18 '18 at 09:49
  • 1
    Acutally in this special case running again is ok.. result is the same.. but try adding some random numeric column to your data.frame and see how the results change because apply() calculates over *every* column. – s_baldur Jul 18 '18 at 09:55
2

Using base R you can do:

df$mean = rowMeans(`is.na<-`(df,df==4),T)#or rowMeans(replace(df,df==4,NA),T)
> df
   Q1 Q2 Q3 Q4 mean
1   1 NA  3 NA 2.00
2   4  3  4  4 3.00
3   2  1  1  1 1.25
4   3  4  2  1 2.00
5   1 NA  4  1 1.00
6   1 NA NA  3 2.00
7   4  3 NA NA 3.00
8   4  4  1  2 1.50
9   1  1  1  2 1.25
10  2  2  2 NA 2.00
Onyambu
  • 67,392
  • 3
  • 24
  • 53