1

EDIT: now with reproducible code/data.

I am trying to run chi-squared tests on multiple variables in my dataframe.

Using the npk dataset:

A single variable, N producing the proper result.

npk %>%
  group_by(yield, N) %>%
  select(yield, N) %>% 
  table() %>% 
  print() %>% 
  chisq.test()

As you can see the output of table() is in a form that chisq.test() can utilize.

        N
  yield  0 1
    44.2 1 0
    45.5 1 0
    46.8 1 0
    48.8 1 1
    49.5 1 0
    49.8 0 1
    51.5 1 0
    52   0 1
    53.2 1 0
    55   1 0
    55.5 1 0
    55.8 0 1
    56   2 0
    57   0 1
    57.2 0 1
    58.5 0 1
    59   0 1
    59.8 0 1
    62   0 1
    62.8 1 1
    69.5 0 1

    Pearson's Chi-squared test

  data:  .
  X-squared = 20, df = 20, p-value = 0.4579

When I try and do multiple tests using a loop something about calling on the particular variable changes the output of my table and the chi-squared test cannot run.

Create the list that the loop runs through:

test_ordinal_variables <- noquote(names(npk[2:4]))
test_ordinal_variables

The loop with the errorcode: (1:1 for clarity, error is repeated if you use 1:3)

for (i in 1:1){
  npk %>%
    group_by(yield, test_ordinal_variables[i]) %>%
    select(yield, test_ordinal_variables[i]) %>%
    table() %>% 
    print() %>% 
    chisq.test()
}

The output clearly showing the table that chisq.test() cannot interpret:

Adding missing grouping variables: `test_ordinal_variables[i]`
, , N = 0

                         yield
test_ordinal_variables[i] 44.2 45.5 46.8 48.8 49.5 49.8 51.5 52 53.2 55 55.5 55.8 56 57 57.2 58.5 59 59.8 62
                        N    1    1    1    1    1    0    1  0    1  1    1    0  2  0    0    0  0    0  0
                         yield
test_ordinal_variables[i] 62.8 69.5
                        N    1    0

, , N = 1

                         yield
test_ordinal_variables[i] 44.2 45.5 46.8 48.8 49.5 49.8 51.5 52 53.2 55 55.5 55.8 56 57 57.2 58.5 59 59.8 62
                        N    0    0    0    1    0    1    0  1    0  0    0    1  0  1    1    1  1    1  1
                         yield
test_ordinal_variables[i] 62.8 69.5
                        N    1    1

For some reason test_ordinal_variables[i] is not evaluating perfectly to what I would expect when it is in the loop. You can see as the error claimed that it is "Adding missing grouping variables", but if it just evaluated the expression rather than adding a variable then I think it would work.

This evaluates on its own as I would expect.

> test_ordinal_variables[1]
[1] N  

So why won't it do the same when it is in the loop?

bar17
  • 13
  • 5
  • 1
    You are missing closing parenetheses of `group_by` and `select`. – Parfait Sep 10 '17 at 03:41
  • Thank you, this helped me get to the current state of the above question, which is now totally different. – bar17 Sep 10 '17 at 22:59
  • Please provide data (several rows with columns) so we can reproduce your issue. See [here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) suggesting `dput()` or random data. And make sure posted data is enough to reproduce issues. – Parfait Sep 11 '17 at 14:45
  • I was able to reproduce the problem with the 'npk' dataset. The question is now edited to reflect this. Thank you for the help with the question and on how to ask the question. – bar17 Sep 11 '17 at 19:09

1 Answers1

0

Since you are passing a dynamic, quoted variable into a dplyr chained method consider the group_by_() and select_() underscore counterpart versions. And since yield is not being dynamically passed, convert it to a symbol() to be processed.

for (i in names(npk[2:4])){      
    npk %>%
      group_by_(as.symbol("yield"), i) %>%
      select_(as.symbol("yield"), i) %>%
      table() %>% 
      print() %>% 
      chisq.test() %>% 
      print()    
}

Output

      N
yield  0 1
  44.2 1 0
  45.5 1 0
  46.8 1 0
  48.8 1 1
  49.5 1 0
  49.8 0 1
  51.5 1 0
  52   0 1
  53.2 1 0
  55   1 0
  55.5 1 0
  55.8 0 1
  56   2 0
  57   0 1
  57.2 0 1
  58.5 0 1
  59   0 1
  59.8 0 1
  62   0 1
  62.8 1 1
  69.5 0 1

    Pearson's Chi-squared test

data:  .
X-squared = 20, df = 20, p-value = 0.4579

      P
yield  0 1
  44.2 0 1
  45.5 1 0
  46.8 1 0
  48.8 0 2
  49.5 0 1
  49.8 1 0
  51.5 1 0
  52   0 1
  53.2 0 1
  55   1 0
  55.5 1 0
  55.8 0 1
  56   1 1
  57   1 0
  57.2 1 0
  58.5 0 1
  59   0 1
  59.8 1 0
  62   1 0
  62.8 0 2
  69.5 1 0

    Pearson's Chi-squared test

data:  .
X-squared = 22, df = 20, p-value = 0.3405

      K
yield  0 1
  44.2 1 0
  45.5 0 1
  46.8 1 0
  48.8 0 2
  49.5 0 1
  49.8 0 1
  51.5 1 0
  52   1 0
  53.2 0 1
  55   0 1
  55.5 0 1
  55.8 0 1
  56   2 0
  57   0 1
  57.2 0 1
  58.5 0 1
  59   1 0
  59.8 1 0
  62   1 0
  62.8 2 0
  69.5 1 0

    Pearson's Chi-squared test

data:  .
X-squared = 24, df = 20, p-value = 0.2424

Warning messages:
1: In chisq.test(.) : Chi-squared approximation may be incorrect
2: In chisq.test(.) : Chi-squared approximation may be incorrect
3: In chisq.test(.) : Chi-squared approximation may be incorrect
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • Thank you! If I am reading [this Advanced R website](http://adv-r.had.co.nz/Computing-on-the-language.html) correctly I was not using [referentially transparent](http://adv-r.had.co.nz/Computing-on-the-language.html#nse-downsides) versions of 'group_by()' and 'select()', so they were evaluating non transparently. The 'group_by_()' and 'select_()' versions fixes this, but then I need to force the variable 'yeild' to evaluate non transparently with the 'as.symbol()' function. Also, moving the 'names(npk[2:4])' to the top makes it easier to read, thanks. :) – bar17 Sep 14 '17 at 00:56