Collecting p-values within pipe (dplyr)

Question

how are you?

So, I have a dataset that looks like this:

    dirtax_trev indtax_trev lag2_majority pub_exp 
    <dbl>       <dbl>       <dbl>         <dbl>    
    0.1542      0.5186      0             9754
    0.1603      0.4935      0             9260      
    0.1511      0.5222      1             8926     
    0.2016      0.5501      0             9682
    0.6555      0.2862      1             10447

I'm having the following problem. I want to execute a series of t.tests along a dummy variable (lag2_majority), collect the p-value of this tests, and attribute it to a vector, using a pipe.

All variables that I want to run these t-tests are selected below, then I omit NA values for my t.test variable (lag2_majority), and then I try to summarize it with this code:

test <- g %>%
 select(dirtax_trev, indtax_trev, gdpc_ppp, pub_exp, 
 SOC_tot, balance, fdi, debt, polity2, chga_demo, b_gov, social_dem,
 iaep_ufs, gini, pov4, informal, lab, al_ethnic, al_language, al_religion,
 lag_left, lag2_left, majority, lag2_majority, left, system, b_system,
 execrlc, allhouse, numvote, legelec, exelec, pr) %>%
 na.omit(lag2_majority) %>%
 summarise_all(funs(t.test(.[lag2_majority], .[lag2_majority == 1])$p.value))

However, once I run this, the response I get is: Error in summarise_impl(.data, dots): Evaluation error: data are essentially constant., which is confusing since there is a clear difference on means along the dummy variable. The same error appears when I replace the last line of the code indicated above with: summarise_all(funs(t.test(.~lag2_majority)$p.value)).

Alternatively, since all I want to do is: t.test(dirtax_trev~lag2_majority, g)$p.value, for instance, I thought I could do a loop, like this: for (i in vars){ t.test(i~lag2_majority, g)$p.value },

Where vars is an object that contains all variables selected in code indicated above. But once again I get an error message. Specifically, this one: Error in model.frame.default(formula = i ~ lag2_majority, data = g): comprimentos das variáveis diferem (encontradas em 'lag2_majority')

What am I doing wrong?

Best Regards!

`t.test(.[lag2_majority], .[lag2_majority == 1])$p.value)` is not a function — moodymudskipper, Nov 24 '17 at 00:28

Kevin Arseneau · Accepted Answer · 2017-11-25T02:00:29.537

Your question is not reproducible, please read this for how you could improve its quality.

My answer has been generalised to be reproducible because I don't have your data and cannot therefore adapt your code directly.

Using a tidy approach I'll produce a data frame of p-values for each variable.

library(tidyr)
library(dplyr)
library(purrr)

mtcars %>%
  select_if(is.numeric) %>%
  map(t.test) %>%
  lapply(`[[`, "p.value") %>%
  as_tibble %>%
  gather(key, p.value)

# # A tibble: 11 x 2
#     key      p.value
#   <chr>        <dbl>
# 1   mpg 1.526151e-18
# 2   cyl 5.048147e-19
# 3  disp 9.189065e-12
# 4    hp 2.794134e-13
# 5  drat 1.377586e-27
# 6    wt 2.257406e-18
# 7  qsec 7.790282e-33
# 8    vs 2.776961e-05
# 9    am 6.632258e-05
# 10 gear 1.066949e-23
# 11 carb 4.590930e-11

update

Thank you for updating your question, note that the value you included in your earlier comment is likely from your original dataset and is still not reproducible here. When I run the code, this is the output.

t.test(dirtax_trev ~ lag2_majority, g)$p.value
# [1] 0.5272474

Please frame your questions in a way that anyone can see the problem in the same way that you do.

To build up the formula you are running through the t.test, I have taken a slightly different approach.

library(magrittr)
library(dplyr)
library(purrr)

g <- tribble(
  ~dirtax_trev, ~indtax_trev, ~lag2_majority, ~pub_exp,
  0.1542, 0.5186, 0, 9754,
  0.1603, 0.4935, 0, 9260,
  0.1511, 0.5222, 1, 8926,
  0.2016, 0.5501, 0, 9682,
  0.6555, 0.2862, 1, 10447
)

dummy <- "lag2_majority"

colnames(g) %>%
  .[. != dummy] %>%          # vector of variables to send through t.test
  paste(., "~", dummy) %>%   # build formula as character
  map(as.formula) %>%        # convert to formula class
  map(t.test, data = g) %$%  # run t.test for each, note the special operator
  tibble(
    data.name = unlist(lapply(., `[[`, "data.name")),
    p.value = unlist(lapply(., `[[`, "p.value"))
  )

# # A tibble: 3 x 2
#                      data.name   p.value
#                          <chr>     <dbl>
# 1 dirtax_trev by lag2_majority 0.5272474
# 2 indtax_trev by lag2_majority 0.5021217
# 3     pub_exp by lag2_majority 0.8998690

If you prefer to drop the dummy variable name from data.name, you could modify its assignment in the tibble with:

data.name = unlist(strsplit(unlist(lapply(., `[[`, "data.name")), paste(" by", dummy)))

N.B. I used the special %$% from magrittr to expose the names from the list of tests to build a data frame. I'm sure there are other ways that may be more elegant, however, I find this form quite easy to reason about.

Kevin, thank you for your answer. I tried to re-frame my question, in order to clarify it a bit, I hope that helps. Sorry for the misunderstanding. Actually, I ran your code and it did collect some p-values, but not from the "correct" p-tests, since I don't know which means it compared. The first test `t.test(dirtax_trev~lag2_majority, g)$p.value` gives 0.00016, whilst with the code you've indicated, it yields a p-value of 3.199368e-36 — ELazzari, Nov 24 '17 at 16:08
Kevin, thank you so much! If I run your code it works perfectly. But for some reason, if I replace the dataset you've created with `tribble` for the one I already have the response I get is: `Error in if (stderr < 10*.Machine$double.eps * max(abs(mx), abs(my))) stop("data are essentially constant"):missing value where TRUE/FALSE needed In addition`. Anyway, there is something wrong with my dataset then. Thanks for the effort! — ELazzari, Nov 27 '17 at 21:06
@ELazzari, there are one or more issues in your data, there are a number of questions ([here](https://stackoverflow.com/questions/9480855/paired-t-test-crashes-apply-loop-edited), [here](https://stackoverflow.com/questions/45186571/how-to-get-na-values-instead-of-a-data-are-essentially-constant-error-in-t-tes), and [here](https://stats.stackexchange.com/questions/112972/t-test-returns-an-error-data-are-essentially-constant)) that may guide you to resolving them. — Kevin Arseneau, Nov 27 '17 at 21:29

Collecting p-values within pipe (dplyr)

1 Answers1

update