1

I'm trying to compare a control group with an experimental group on a range of variable to show that they are similar (baseline).

I thus need to do multiple t-test (unpaired/ Welch t-test). My data is in a long format with the first variable called "Group" with either a number 1 or a number 2. There are some missing values in some of my other variables but it's pretty random.

So when I run t-test manually using this line of code:

t.test(variable_1 ~ Group,df)

it works.

I then tried to do it all at once using this line of code:

 sapply(df[,2:71], function(i) t.test(i ~ df$Group)$p.value)

But I get the following error:

grouping factor must have exactly 2 levels

Could anyone help?

Here is what the structure looks like

structure(list(Group = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 2, 2), EM_Accuracy_Time_Airport = c(3, 3, 0, 
1, 1, 2, 2, 1, 1, 3, 3, 2, 2, 2, 1, 3, 1, 3, 1, 1), EM_Accuracy_Place_Airport = c(2, 
2, 1, 2, 1, 2, 2, 1, 1, 2, 0, 2, 2, 0, 2, 2, 2, 1, 1, 1), EM_Accuracy_Expl_Airport = c(2, 
2, 2, 0, 2, 2, 2, 1, 2, 2, 2, 2, 2, 0, 0, 1, 0, 2, 2, 1), EM_Accuracy_Death_Airport = c(0, 
2, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 3, 0, 0, 0), EM_Accuracy_Time_Metro = c(3, 
1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 2, 1, 3, 1, 1, 2, 1, 3, 3), EM_Accuracy_Death_Metro = c(3, 
0, 1, 0, 1, 1, 0, 0, 0, 3, 0, 0, 1, 0, 3, 1, 1, 1, 0, 0), EM_Accuracy_PC_Time_Airpot = c(100, 
100, 0, 33.3333333333333, 33.3333333333333, 66.6666666666667, 
66.6666666666667, 33.3333333333333, 33.3333333333333, 100, 100, 
66.6666666666667, 66.6666666666667, 66.6666666666667, 33.3333333333333, 
100, 33.3333333333333, 100, 33.3333333333333, 33.3333333333333
), EM_Accuracy_PC_Place_Airport = c(100, 100, 50, 100, 50, 100, 
100, 50, 50, 100, 0, 100, 100, 0, 100, 100, 100, 50, 50, 50), 
    EM_Accuracy_PC_Expl_Airport = c(100, 100, 100, 0, 100, 100, 
    100, 50, 100, 100, 100, 100, 100, 0, 0, 50, 0, 100, 100, 
    50), EM_Accuracy_PC_Death_Airport = c(0, 66.6666666666667, 
    0, 0, 33.3333333333333, 66.6666666666667, 0, 0, 0, 0, 0, 
    0, 66.6666666666667, 0, 0, 0, 100, 0, 0, 0), EM_Accuracy_PC_Time_Metro = c(100, 
    33.3333333333333, 0, 0, 33.3333333333333, 33.3333333333333, 
    0, 33.3333333333333, 33.3333333333333, 33.3333333333333, 
    33.3333333333333, 66.6666666666667, 33.3333333333333, 100, 
    33.3333333333333, 33.3333333333333, 66.6666666666667, 33.3333333333333, 
    100, 100), EM_Accuracy_PC_Death_Metro = c(100, 0, 33.3333333333333, 
    0, 33.3333333333333, 33.3333333333333, 0, 0, 0, 100, 0, 0, 
    33.3333333333333, 0, 100, 33.3333333333333, 33.3333333333333, 
    33.3333333333333, 0, 0), EM_ACCURACY_PC = c(83.3333333333333, 
    66.6666666666667, 30.5555555555556, 22.2222222222222, 47.2222222222222, 
    66.6666666666666, 44.4444444444444, 27.7777777777778, 36.1111111111111, 
    72.2222222222222, 38.8888888888889, 55.5555555555555, 66.6666666666666, 
    27.7777777777778, 44.4444444444444, 52.7777777777778, 55.5555555555556, 
    52.7777777777778, 47.2222222222222, 38.8888888888889), EM_Certainty_Time_Airport = c(3, 
    1, 1, 1, 2, 2, 1, 1, 2, 3, 3, 2, 2, 2, 4, 2, 3, 3, 2, 2), 
    EM_Certainty__Place_Airport = c(3, 4, 2, 2, 2, 2, 4, 1, 3, 
    4, 4, 4, 4, 3, 3, 4, 4, 3, 2, 3), EM_Certainty__Expl_Airport = c(4, 
    2, 3, 1, 2, 3, 2, 1, 2, 4, 1, 3, 2, 2, 1, 3, 1, 2, 2, 3), 
    EM_Certainty__Death_Airport = c(1, 1, NA, 1, 2, 1, 3, 1, 
    2, 3, NA, 3, 2, 1, 2, 1, 1, 1, 4, 4), EM_Certainty__Time_Metro = c(3, 
    3, 1, 1, 2, 2, 2, 1, 3, 2, 3, 2, 3, 2, 2, 2, 3, 1, 2, 2), 
    EM_Certainty__Death_Metro = c(2, 1, 1, NA, 2, 1, 1, 1, 2, 
    1, NA, 3, 2, 1, 1, 1, 1, 1, 1, 4), EM_CERTAINTY = c(2.66666666666667, 
    2, 1.6, 1.2, 2, 1.83333333333333, 2.16666666666667, 1, 2.33333333333333, 
    2.83333333333333, 2.75, 2.83333333333333, 2.5, 1.83333333333333, 
    2.16666666666667, 2.16666666666667, 2.16666666666667, 1.83333333333333, 
    2.16666666666667, 3), EM_CONFIDENCE = c(5, 5, 1, 2, 2, 4, 
    5, 2, 3, 4, 5, 5, 3, 3, 4, 4, 3, 2, 3, 2), FBM_CONFIDENCE = c(4, 
    6, 7, 7, 5, 4, 2, 7, 5, 6, 6, 7, 6, 7, 3, 6, 6, 4, 5, 6), 
    FBM_Vividness_Time = c(3, 3, 1, 4, 3, 2, 4, 3, 4, 4, 1, 3, 
    4, 4, 3, 3, 3, 2, 4, 3), FBM_Vividness_How = c(4, 4, 2, 4, 
    4, 3, 4, 4, 4, 4, 3, 4, 3, 4, 4, 4, 4, 4, 4, 4), FBM_Vividness_Where = c(4, 
    4, 2, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4), 
    FBM_Vividness_WithWhom = c(4, 4, 3, 4, 3, 4, 4, 4, 4, 4, 
    4, 4, 4, 4, 4, 4, 4, 4, 4, 4), FBM_Vividness_WereDoing = c(4, 
    4, 1, 4, 3, 4, 4, 4, 4, 4, 1, 4, 4, 4, 4, 4, 4, 4, 4, 4), 
    FBM_Vividness_Did_After = c(4, 4, 3, 4, 2, 3, 4, 4, 2, 4, 
    1, 4, 4, 4, 3, 4, 4, 3, 4, 4), FBM_VIVIDNESS = c(3.83333333333333, 
    3.83333333333333, 2, 4, 3.16666666666667, 3.33333333333333, 
    4, 3.83333333333333, 3.66666666666667, 4, 2.33333333333333, 
    3.83333333333333, 3.83333333333333, 4, 3.66666666666667, 
    3.83333333333333, 3.83333333333333, 3.5, 4, 3.83333333333333
    ), FBM_Details_NB_T2 = c(3, 5, 0, 5, 5, 5, 2, 5, 1, 5, 3, 
    5, 5, 5, 2, 4, 2, 3, 5, 5), P_Novelty_5 = c(5, 6.2, 6.5, 
    5.6, 4.8, 5.4, 4, 4.2, 4.4, 5.8, 3.4, 5.8, 6, 5.8, 3.8, 6.4, 
    6.8, 6.6, 7, 3), P_Suprise_emotion = c(6, 6, 6, 6, 4, 5, 
    1, 7, 1, 5, 4, 5, 7, 7, 6, 4, 7, 7, 2, 5), P_Surprise_Expected = c(1, 
    3, 5, 2, 4, 3, 6, 2, 2, 1, 6, 4, 3, 1, 5, 1, 1, 1, 5, 4), 
    P_Surprise_Unbelievable = c(5, 4, 1, 6, 4, 4, 2, 7, 1, 4, 
    1, 6, 7, 7, 6, 3, 7, 7, 5, 3), `P_Consequence-Importance_5` = c(5.6, 
    4.8, 3.4, 5, 4.8, 4, 5, 5.4, 3, 5.2, 6.8, 5.4, 4, 4.4, 6, 
    3.8, 4, 4.8, 5, 5.2), P_Emotional_Intensity_4 = c(5.25, 5.75, 
    3, 4.75, 4.75, 6, 4, 5.25, 2.5, 5.5, 7, 6.5, 5.75, 6.75, 
    6.75, 6, 6.25, 6, 5, 2.5), P_Social_Sharing_6 = c(3.66666666666667, 
    3.83333333333333, 3.4, 3.16666666666667, 3, 3.33333333333333, 
    3.8, 3.16666666666667, 2.16666666666667, 4.16666666666667, 
    4, 4.5, 4.5, 4.33333333333333, 4, 3.16666666666667, 3.66666666666667, 
    4, NA, NA), P_Media_3 = c(4.66666666666667, 4, 3, 2.66666666666667, 
    2.66666666666667, 2.33333333333333, 3, 2.33333333333333, 
    2.33333333333333, 3.33333333333333, 4.33333333333333, 5, 
    4.33333333333333, 5, 4, 2, 3, 3.33333333333333, 2, 1.66666666666667
    ), P_Ruminations = c(3, NA, 3, 2, 4, NA, 4, 2, 1, 4, 4, 4, 
    2, 4, 2, 3, 3, 3, 4, 3), P_Novelty_Common_rev = c(6, 7, 7, 
    7, 4, 6, 4, 7, 2, 6, 3, 7, 7, 7, 3, 6, 7, 7, 7, 3), P_Novelty_Unusual = c(2, 
    5, 7, 7, 3, 5, 3, 3, 5, 6, 1, 4, 7, 1, 4, 6, 6, 6, 7, 2), 
    P_Novelty_Special = c(6, 6, NA, 6, 5, 5, 4, 3, 5, 4, 1, 5, 
    6, 7, 4, 6, 7, 7, 7, 3), P_Novelty_Singular = c(4, 6, 5, 
    1, 5, 5, 4, 1, 3, 6, 5, 6, 4, 7, 3, 7, 7, 6, 7, 2), P_Novelty_Ordinary_rev = c(7, 
    7, 7, 7, 7, 6, 5, 7, 7, 7, 7, 7, 6, 7, 5, 7, 7, 7, 7, 5), 
    P_Consequence = c(6, 7, 5, 4, 5, 4, 5, 3, 5, 5, 7, 5, 5, 
    2, 6, 6, 1, 4, 6, 3), P_Importance_self = c(4, 3, 3, 4, 4, 
    3, 5, 6, 1, 5, 7, 5, 3, 3, 5, 2, 2, 4, 5, 3), `P_Importance_friends&family` = c(4, 
    4, 3, 4, 4, 4, 4, 6, 1, 5, 6, 5, 3, 3, 5, 2, 6, 4, 5, 10), 
    P_Importance_Belgium = c(7, 5, 3, 7, 6, 5, 6, 7, 3, 7, 7, 
    7, 5, 7, 7, 5, 6, 7, 6, 6), P_Importance_International = c(7, 
    5, 3, 6, 5, 4, 5, 5, 5, 4, 7, 5, 4, 7, 7, 4, 5, 5, 3, 4), 
    P_Emotional_Intensity_Upset = c(4, 5, NA, 3, 3, 5, 3, 5, 
    2, 5, 7, 5, 5, 6, 7, 6, 6, 5, 5, 3), P_Emotional_Intensity_Indiferent_rev = c(7, 
    7, 5, 7, 6, 7, 4, 6, 4, 7, 7, 7, 7, 7, 7, 7, 7, 7, NA, 4), 
    P_Emotional_Intensity_Affected = c(6, 6, 3, 5, 5, 6, 5, 6, 
    2, 5, 7, 7, 5, 7, 7, 6, 6, 6, NA, 2), P_Emotional_Intensity_Shaken = c(4, 
    5, 1, 4, 5, 6, 4, 4, 2, 5, 7, 7, 6, 7, 6, 5, 6, 6, 5, 1), 
    P_Rehearsal_Media_TV = c(5, 3, NA, 3, 2, 3, NA, 1, 1, 4, 
    3, 5, 5, 5, 2, 3, 2, 2, 2, 2), P_Rehearsal_Media_Internet = c(4, 
    4, 1, 3, 2, 2, 2, 4, 3, 2, 5, 5, 3, 5, 5, 1, 5, 4, 2, 1), 
    P_Rehearsal_Media_Social_Networks = c(5, 5, 5, 2, 4, 2, 4, 
    2, 3, 4, 5, 5, 5, 5, 5, 2, 2, 4, 2, 2), P_Social_Sharing_How_Often = c(4, 
    5, 4, 4, 4, 3, 3, 3, 3, 5, 4, 5, 5, 5, 5, 3, 4, 4, 5, NA), 
    P_Social_Sharing_With_How_Many_People = c(5, 4, NA, 3, 3, 
    3, 3, 3, 2, 5, 3, 5, 5, 3, 5, 3, 3, 4, 3, NA), PK_Shops_YN = c(0, 
    1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1), 
    PK_Comic = c(0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 
    0, 0, 0, 1, 0), PK_Hotel = c(0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 
    0, 0, 1, 1, 0, 0, 0, 0, 0, 0), PK_Decoration_Maelbeek = c(0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1), 
    PK_Stations_before_after_Maelbeek = c(0, 0.5, 0, 0, 0, 0, 
    0, 0, 0.5, 1, 0, 0, 0.5, 0.5, 0, 0, 0.5, 0, 0.5, 0), PK_TOTAL_PC = c(0, 
    50, 0, 40, 40, 40, 20, 0, 10, 60, 20, 40, 90, 70, 20, 0, 
    30, 20, 70, 40), SI_Attachment_BXL = c(6, 4, 1, 4, 2, 5, 
    1, 6, 5, 4, 2, 6, 6, 7, 1, 3, 6, 4, 5, 4), SI_Pride_BXL = c(1, 
    2, 1, 2, 1, 2, 1, 5, 1, 6, 1, 1, 7, 7, 1, 2, 6, 1, 3, 3), 
    SI_Attachment_Belgium = c(7, 3, 5, 5, 4, 6, 7, 6, 5, 6, 7, 
    7, 7, 7, 5, 6, 7, 6, 4, 2), SI_Pride_Belgium = c(7, 2, 6, 
    4, 2, 6, 4, 5, 1, 5, 1, 6, 7, 7, 5, 7, 7, 6, 2, 2), SI_Attachment_EU = c(6, 
    4, 2, 5, 4, 4, 5, 4, 7, 4, 1, 6, 7, 7, 5, 4, 6, 6, 2, 6), 
    SI_Pride_EU = c(7, 1, 1, 4, 3, 4, 4, 4, 1, 4, 1, 6, 7, 7, 
    4, 3, 6, 6, 2, 4)), .Names = c("Group", "EM_Accuracy_Time_Airport", 
"EM_Accuracy_Place_Airport", "EM_Accuracy_Expl_Airport", "EM_Accuracy_Death_Airport", 
"EM_Accuracy_Time_Metro", "EM_Accuracy_Death_Metro", "EM_Accuracy_PC_Time_Airpot", 
"EM_Accuracy_PC_Place_Airport", "EM_Accuracy_PC_Expl_Airport", 
"EM_Accuracy_PC_Death_Airport", "EM_Accuracy_PC_Time_Metro", 
"EM_Accuracy_PC_Death_Metro", "EM_ACCURACY_PC", "EM_Certainty_Time_Airport", 
"EM_Certainty__Place_Airport", "EM_Certainty__Expl_Airport", 
"EM_Certainty__Death_Airport", "EM_Certainty__Time_Metro", "EM_Certainty__Death_Metro", 
"EM_CERTAINTY", "EM_CONFIDENCE", "FBM_CONFIDENCE", "FBM_Vividness_Time", 
"FBM_Vividness_How", "FBM_Vividness_Where", "FBM_Vividness_WithWhom", 
"FBM_Vividness_WereDoing", "FBM_Vividness_Did_After", "FBM_VIVIDNESS", 
"FBM_Details_NB_T2", "P_Novelty_5", "P_Suprise_emotion", "P_Surprise_Expected", 
"P_Surprise_Unbelievable", "P_Consequence-Importance_5", "P_Emotional_Intensity_4", 
"P_Social_Sharing_6", "P_Media_3", "P_Ruminations", "P_Novelty_Common_rev", 
"P_Novelty_Unusual", "P_Novelty_Special", "P_Novelty_Singular", 
"P_Novelty_Ordinary_rev", "P_Consequence", "P_Importance_self", 
"P_Importance_friends&family", "P_Importance_Belgium", "P_Importance_International", 
"P_Emotional_Intensity_Upset", "P_Emotional_Intensity_Indiferent_rev", 
"P_Emotional_Intensity_Affected", "P_Emotional_Intensity_Shaken", 
"P_Rehearsal_Media_TV", "P_Rehearsal_Media_Internet", "P_Rehearsal_Media_Social_Networks", 
"P_Social_Sharing_How_Often", "P_Social_Sharing_With_How_Many_People", 
"PK_Shops_YN", "PK_Comic", "PK_Hotel", "PK_Decoration_Maelbeek", 
"PK_Stations_before_after_Maelbeek", "PK_TOTAL_PC", "SI_Attachment_BXL", 
"SI_Pride_BXL", "SI_Attachment_Belgium", "SI_Pride_Belgium", 
"SI_Attachment_EU", "SI_Pride_EU"), row.names = c(NA, -20L), class = c("tbl_df", 
"tbl", "data.frame"))
NelsonGon
  • 13,015
  • 7
  • 27
  • 57
AlineC
  • 11
  • 2
  • Please add a sample of your data with `dput(head(df,n))`. Choose n as may be sufficient for reproducibility. – NelsonGon Jul 16 '19 at 09:11
  • 1
    Googling your error message leads me to this post https://stackoverflow.com/questions/29421475/basic-t-test-grouping-factor-must-have-exactly-2-levels where it suggests to do `t.test(i, df$Group)$p.value`. Can you try that in your `sapply` call ? – Ronak Shah Jul 16 '19 at 09:14
  • 1
    @NelsonGon : I have added the structure. – AlineC Jul 16 '19 at 09:24
  • @Ronak Shah: using a comma does a different type of test, which compares two variables (the one before the comma with the one after the comma). It works for a wide format, but I'm using a long format with a grouping variable – AlineC Jul 16 '19 at 09:26
  • What is `variable_1`? – NelsonGon Jul 16 '19 at 09:31
  • Any of the variable after "Group". So for example: it works fine if I run t.test(EM_Accuracy_Place_Airport ~ Group,df) – AlineC Jul 16 '19 at 09:33
  • Some of the columns have more than two levels (eg, `EM_Accuracy_Time_Airport = c(3, 3, 0, 1, 1, 2, 2, 1, 1, 3, 3, 2, 2, 2, 1, 3, 1, 3, 1, 1)`). You could consider formulating this problem as a linear regression rather than a simple t-test. – alan ocallaghan Jul 16 '19 at 09:44
  • There could be many reasons why your code breaks. One of them is because in one of the variables you have only `NA`s associated with one level of your grouping variable. Check this: `t.test(df$P_Social_Sharing_6 ~ df$Group)` – AntoniosK Jul 16 '19 at 09:50
  • @Aocall: that shouldn't be an issue as Group is my IV and all the others are DV. So I want to compare Group 1 with Group 2 for all the other variables. – AlineC Jul 16 '19 at 09:50
  • @AntoniosK! How did I miss that one! Yes, thanks, I've added the missing values and it works! Thank you! – AlineC Jul 16 '19 at 10:00

2 Answers2

2

The error you get means that there's a problem in your dataset, with at least one of your variables.

Here's a process to help you spot problematic variables:

library(tidyverse)

df %>%
  group_by(Group) %>%                   # for each group value
  summarise_all(~sum(!is.na(.))) %>%    # count non NA values for each variable
  gather(var,value,-Group) %>%          # reshape
  spread(Group, value, sep = "_") %>%   # reshape
  filter(Group_2 < 2)                   # get problematic variables

# # A tibble: 5 x 3
#   var                                   Group_1 Group_2
#   <chr>                                   <int>   <int>
# 1 P_Emotional_Intensity_Affected             18       1
# 2 P_Emotional_Intensity_Indiferent_rev       18       1
# 3 P_Social_Sharing_6                         18       0
# 4 P_Social_Sharing_How_Often                 18       1
# 5 P_Social_Sharing_With_How_Many_People      17       1

0 counts will throw an error about needing two levels in your grouping variables.

1 count will throw an error about needing more observations in one of your groups.

After spotting those you have to treat them accordingly and then your original t.test code should work.

AntoniosK
  • 15,991
  • 2
  • 19
  • 32
0

So my problem was just missing data in one variable.

However, if you are looking at doing multiple T-test in a long format: this line of code works:

sapply(df[,2:71], function(i) t.test(i ~ df$Group)$p.value)

AlineC
  • 11
  • 2