0

I am trying to clean a data set so I can analyse it with ease.

I have a data set that looks like this:

              z  a   b   c   d  a_1   b_2   c_3   d_4  ab_1  ab_2
Participant1  A  1   3   4   3   2     3     6     7     2    1
Participant2  B  3   2   4   5   6     4     2     1     2    1
Participant3  C  1   3   5   4   2     2     1     2     3    4
Participant4  D  1   3   8   1   3     4     5     4     5    2

I added a column of the average score for each tested variable. Accordingly I used the below codes:

CleanData <- CleanData%>%
  group_by(ResponseId) %>%
  mutate(MeanA = mean(as.numeric(c(a, b, c, d)))) %>%
  mutate(MeanB = mean(as.numeric(c(a_1, b_2, c_3, d_4)))) %>%
  mutate(MeanC = mean(as.numeric(c(ab_1, ab_2)))) 

Using the above codes, I was able to add three new columns (MeanA, MeanB & MeanC) to the existing data frame, i.e:

            Z a   b   c   d  a_1   b_2   c_3   d_4  ab_1  ab_2 meanA meanB meancC
Participant1 A 1   3   4   3   2     3     6     7     2    1   2.75   4,5    1,5
Participant2 B 3   2   4   5   6     4     2     1     2    1     x     x     x
Participant3 C 1   3   5   4   2     2     1     2     3    4     x     x     x
Participant4 D 1   3   8   1   3     4     5     4     5    2     x     x     x

However, the mean columns do not show the correct numbers. For instance, mean of the values ab_1: 3 and ab_2: 4 should return 3.5, but the column instead shows 5. Can anyone help me understand why this is happening?

Thank you

UPDATE: I figured out that the values: a_1, b_2, c_3, d_4, ab_1, ab_2 are in factors, so I have changed them into numerics by adding, for instance:

as.numeric(as.character(mean(c(ab_1, ab_2))))

But it still does not return the correct values!!

UPDATE2: dput returns below

structure(list(ResponseId = structure(c(32L, 40L, 113L, 43L), .Label = c("R_10UwSkTd0ZY0I6s", 
"R_12z514ODiHL80r6", "R_1Bx5NSYrdqhaIuJ", "R_1C8bt9iRS6L1817", 
"R_1C91Jv4yYTt8VsS", "R_1Cj5l50lu2VKcxg", "R_1DD2zbJPiueDijH", 
"R_1FA36tDxZEUqBJM", "R_1FLH4KQaKlxMn77", "R_1FeUYLmPoDpKnOb", 
"R_1GAXJ8YmAJflif4", "R_1GE4YeIfEuZuUYX", "R_1GyaiQcync3sHMJ", 
"R_1IRUh7uRVEI3Op1", "R_1KjzzPz48WkuCFk", "R_1LcddBAuXLiBDgo", 
"R_1M65lFIT9h0TrpQ", "R_1NCw01UV09WSGSX", "R_1QlN6pk1IH5xXqn", 
"R_1XMp4A95WZLJmWl", "R_1dBn0Bb9ZPprg3R", "R_1dcu8FxLC7E2tq8", 
"R_1dgBt0IGuRjDD4W", "R_1eRGDucOhbsOkgI", "R_1eXovzU5BJzWdt9", 
"R_1g5Unx7VkpjcHEe", "R_1gtnPQhBie1xBLq", "R_1guMjae0AF4k9yh", 
"R_1j6xFfGDt6QCgc3", "R_1jOf096UI4ManbG", "R_1l3smK23l8TPmso", 
"R_1ohW9lO1Qw7D2Nt", "R_1pDO3wCW2J7X9Mw", "R_1qU6JvhOWUPen4Y", 
"R_23e2rcSv4MzoilL", "R_27W3bSNMVuVHxLl", "R_2Bqy2lJu3vKgTkC", 
"R_2PiG0aapXGpeAhi", "R_2PjUnRQM4nSaYuO", "R_2QnhIx6nPDMKt4J", 
"R_2S1saV61VMsVhzA", "R_2SjpkptLr59Pi5W", "R_2VQn7jIGZe7rSiJ", 
"R_2VgwMZugDIzh2gc", "R_2Y3ylZL7Os57FQg", "R_2YtPLlRDm3W0IiF", 
"R_2cjhW6bLhpKoWeD", "R_2czLWk4DA7AtUWB", "R_2dhg48SvjsPF7xi", 
"R_2e3xcJj4B4sVxpu", "R_2fcI8PF5YakOibH", "R_2qqi75IiKtupT3k", 
"R_2s705LLk7dRpBtu", "R_2tKAG3kO0uTAp3W", "R_2tglcBV7ZIr5fRL", 
"R_2uWFRjzrewJ9lZp", "R_2w6pFIySy6z417N", "R_2wRKqbJfulfII8L", 
"R_2zTr8igVCFw8pBZ", "R_2zbYrfdxoSMQDZw", "R_2zqoRmevtHcELGs", 
"R_32XQhbZdKS1gZqP", "R_33C0jdizQOZOCam", "R_3CWxNxWuLPFvTzi", 
"R_3DbBdhc5Z5gYa8p", "R_3ESZKQIuMZpxjWb", "R_3HhHnbzqwkjBTyr", 
"R_3L76M1sxyZT8LZF", "R_3LecRRD5DH4kAHZ", "R_3NwVV3CjA62MD4l", 
"R_3Oj2rtphCHA2eid", "R_3OjKwwD8cFQtLqs", "R_3e3rPFQew2oBcZB", 
"R_3gT9FFFMBZ0iVhF", "R_3h6MPewodtUKKEi", "R_3lAoiHlF9KgfbFB", 
"R_3lFbGqMCvw5n4bo", "R_3nAFNfIRZmI8wzE", "R_3nTXeACPBHRGkZI", 
"R_3q7MZwtck0MNgI0", "R_3qwfz0yMLuWw81M", "R_3suKETbTurYGy0m", 
"R_3yleFr647Wb8EQV", "R_55a2UXtA2wqJeXD", "R_5w2RvbtXFaQpLMd", 
"R_6Gs5YBPSE2abloR", "R_6LFgmHhwol4jAEp", "R_9WRZaEXexVKfN4t", 
"R_AMNieClr3z90yuB", "R_BtXxoHRS0zenny1", "R_Cg1tqfa6iBeV6Ct", 
"R_DiBGGojPdv9dDjz", "R_DjKGN63D9nlkRoJ", "R_DzUYyku3vvPVHgd", 
"R_OlJDf3crqx5ixC9", "R_T6pA7HhF1QdSffz", "R_UlL0oXl78e1DAWJ", 
"R_Wk3USlMJv0smEXD", "R_XRO3jxTdfN1ATyV", "R_XyLTjQJ24ZC2WnT", 
"R_Z3o0VAV64k7sUmt", "R_ZltqETejsEhSOE9", "R_b4xwkSOfYOui3nP", 
"R_bPCma1oTtGdWKpr", "R_cuAayrJroDlIlep", "R_dclKNvKcXlAfHs5", 
"R_dgr9ooE75LVOpWh", "R_e2oMPa2WBgFmW9H", "R_eF0zW2RqPXuWtah", 
"R_eKfj3XyRPgRlEsh", "R_oXC0xDrC6uq8uEp", "R_qPowYCxUMjc2Bz3", 
"R_sMtw9IU4BTmSBMt", "R_tVxiAfQAU1TpQ8p", "R_vq3xb7ppDytKW4x", 
"R_x8TnBj3YBj2R345", "R_xitfzNOT9yXK1vr", "R_yJSFhJqUec5dHfH", 
"Response ID", "{\"ImportId\":\"_recordId\"}"), class = "factor"), 
    FL_8_DO = structure(c(6L, 6L, 1L, 1L), .Label = c("", "FL_27", 
    "FL_28", "FL_8 - Block Randomizer - Display Order", "F", 
    "H", "{\"ImportId\":\"FL_8_DO\"}"), class = "factor"), 
    Gender = structure(c(2L, 2L, 1L, 2L), .Label = c("", "1", 
    "2", "what is your gender?", "{\"ImportId\":\"QID14\"}"), class = "factor"), 
    Age = structure(c(2L, 10L, 1L, 7L), .Label = c("", "10", 
    "12", "14", "20", "3", "4", "5", "6", "7", "8", "9", "how old are you?", 
    "{\"ImportId\":\"QID27\"}"), class = "factor"), Per_1 = c(4, 
    2, NA, 1), Per_2 = c(3, 2, NA, 1), A_1 = c(NA_real_, 
    NA_real_, NA_real_, NA_real_), A_2 = c(NA_real_, NA_real_, 
    NA_real_, NA_real_), A_3 = c(NA_real_, NA_real_, NA_real_, 
    NA_real_), A_4 = c(NA_real_, NA_real_, NA_real_, NA_real_
    ), A_5 = c(NA_real_, NA_real_, NA_real_, NA_real_), A_6 = c(NA_real_, 
    NA_real_, NA_real_, NA_real_), A_7 = c(NA_real_, NA_real_, 
    NA_real_, NA_real_), A_8 = c(NA_real_, NA_real_, NA_real_, 
    NA_real_), A_9 = c(NA_real_, NA_real_, NA_real_, NA_real_
    ), Mo_1 = c(NA_real_, NA_real_, NA_real_, NA_real_), 
   Mo_2 = c(NA_real_, NA_real_, NA_real_, NA_real_), Mo_3 = c(NA_real_, 
    NA_real_, NA_real_, NA_real_), Mo_4 = c(NA_real_, NA_real_, 
    NA_real_, NA_real_), MeanA = c(NA_real_, NA_real_, NA_real_, 
    NA_real_)), .Names = c("ResponseId", "FL_8_DO", "Gender", 
"Age", "Per_1", "Per_2", "A_1", "A_2", "A_3", 
"A_4", "A_5", "A_6", "A_7", "A_8", "A_9", "Mo_1", 
"Mo_2", "Mo_3", "Mo_4", "MeanA"), row.names = c(NA, 
4L), class = "data.frame")
user240313
  • 17
  • 6
  • Can you show an example. Do you need `rowMeans` instead of `mean` i.e. `CleanData %>% mutate(MeanA = rowMeans(.[c('a', 'b', 'c', 'd')]), MeanB = rowMeans(.[c("a_1", "b_2", "c_3", "d_4")]))` Or is it the `mean` of the columns and the rows per group? – akrun Apr 22 '19 at 18:48
  • This would be much easier to answer if you can provide some sample data and show expected output. Otherwise, it's hard to know what's wrong, because this is a logic error not syntax error and we don't know the logic you want to implement. – Reeza Apr 22 '19 at 19:28
  • Thank you for your replies, I added more information above!! – user240313 Apr 23 '19 at 06:36
  • I tried rowMeans, but it returns an error: Error in mutate_impl(.data, dots) : Evaluation error: 'x' must be numeric. – user240313 Apr 23 '19 at 06:39
  • all of the relevant columns are `NA`. Difficult to calculate a meaningful mean in this case. – Cettt Apr 23 '19 at 13:43
  • Most of the values turned into NAs when I used the code: df %>% mutate_at(vars("a", "b", "c", "d"), ~as.numeric(levels(.x))[.x]) – user240313 Apr 23 '19 at 14:09
  • 1
    Update: Ah I see what went wrong, one of the columns already displayed numeric, hence it didn't work when I used the above code. Now I have solved all the problems, thanks a lot for the help @Cettt! – user240313 Apr 23 '19 at 14:13

1 Answers1

0

Hi the problem is that mutate operates on columns.

Consider this small example:

df <- tibble(a = 1:3, b = 4:6)
df %>% mutate(x = mean(c(a, b)))
    # A tibble: 3 x 3
      a     b     x
  <int> <int> <dbl>
1     1     4   3.5
2     2     5   3.5
3     3     6   3.5

What mutate does is it stacks a and b together such that it gets the vector 1,2,3,4,5,6 and then computes its mean which is 3.5.

In order for mutate to operate row-wise you have to use the rowwise function:

df %>% rowwise() %>%  mutate(x = mean(c(a, b)))
# A tibble: 3 x 3
      a     b     x
  <int> <int> <dbl>
1     1     4   2.5
2     2     5   3.5
3     3     6   4.5

UPDATE: If you have columns which are class factor you should first convert them to numeric. There are several ways to do so as presented in this question.

In your case you could use

df %>% mutate_at(vars("a", "b", "c", "d"), ~as.numeric(levels(.x))[.x])

to convert columns a, b, c and d from factor to numeric.

Cettt
  • 11,460
  • 7
  • 35
  • 58
  • Thank you for your comment - I figured adding the code "group_by(ResponseId)" would solve this problem, would this not? Additionally, I am afraid it still does not return the correct values – user240313 Apr 23 '19 at 09:40
  • Based on the data you provided I don't think you need a `group_by`. Maybe if you include your desired output (including numbers) we could better understand what you are looking for. – Cettt Apr 23 '19 at 10:01
  • Upadate: I see that when converting factors into numeric, using the above code, 1 is added to all values. So what is 4 in factor is converted into 5 in numeric - I think this is causing the problem! – user240313 Apr 23 '19 at 10:02
  • I have edited the original question, so it shows the desired outcomes for participant1 - hope this helps..? – user240313 Apr 23 '19 at 10:05
  • I updated my answer to show you how to deal with factors. – Cettt Apr 23 '19 at 10:18
  • Thank you for the comment & the code! I have tried with the code you indicated, and I get an error message: "NAs introduced by coercionNAs introduced by coercion", with the columns filled with NAs. Would you know why this is so? thank you – user240313 Apr 23 '19 at 12:53
  • Please post the output of `dput(head(<>))` inside your question – Cettt Apr 23 '19 at 12:55
  • Thank you, I have posted the output! – user240313 Apr 23 '19 at 13:35