0

I am a bit confused with conducting a test of proportions in R. Maybe this is super obvious, but prop.test behaves differently than I expected, and I would like to know why and what to use instead. The application is on a dataset of protest events.

I constructed the following dataset:

enter image description here

where the name refers to the type of the percentage of events being calculated. The first row refers to events organized after elections (aft_elect_prt). Within each of this category I calculate the number of events which have (past_pm1) or have not been linked to a group of a former prime minister (past_pm0) . Total refers to the number of events in the dataset of the specific type. Share0 is past_pm0/total, share1 is past_pm1/total.

I want to test the null hypothesis that there is no statistically significant difference between the two shares.

Reading the documentation of prop.test I set it up as:

prop.test(x = as.numeric(subseted$past_pm1),
          n = subseted$total,
          p = subseted$share0,
          alternative = "two.sided",
          conf.level = 0.95)

However, this obviously does not test what I want. It also results in only one p value, whereas I would like to extract a p value for each row. What function/test should I use instead?

This is the dput code for the dataset:

structure(list(names = c("aft_elect_prt", "ANSM", "bef_elect_prt", 
"big_event", "conf_viol", "coorg", "demo_petition", "economic", 
"NSM", "political"), past_pm0 = c(49.66101, 78.54659, 65.57226, 
49.67205, 39.641924, 69.52704, 286.8565, 68.53114, 100.00488, 
117.97347), past_pm1 = c(33.796, 14.30855, 34.40608, 31.14065, 
9.017051, 30.64896, 120.4515, 20.86095, 19.00836, 71.24065), 
    total = c(83.4570157825947, 92.8551414906979, 99.9783371835947, 
    80.8127028793097, 48.6589741557837, 100.176002234221, 407.307988807559, 
    89.3920872062445, 119.013234868646, 189.21411934495), share0 = c(0.595048954654295, 
    0.8459045857775, 0.655864678761227, 0.614656461548911, 0.814688856223823, 
    0.69404885850245, 0.704274180429913, 0.766635416419863, 0.84028368870382, 
    0.623491895892433), share1 = c(0.404950976057405, 0.154095398168484, 
    0.344135349408928, 0.385343502821669, 0.185311161125829, 
    0.305951119194593, 0.295725847049147, 0.233364614832964, 
    0.159716354412006, 0.376508107569518)), row.names = c(NA, 
-10L), class = "data.frame")
Nimantha
  • 6,405
  • 6
  • 28
  • 69
Erdne Htábrob
  • 819
  • 11
  • 29
  • I see that the numbers in the image have thousands separators, and the data you have read is is off by a factor of one million. This might lead to problems. See here for ways to solve this: https://stackoverflow.com/questions/1523126/how-to-read-data-when-some-numbers-contain-commas-as-thousand-separator – AkselA May 23 '19 at 12:46
  • And with that correction all these proportions are trivially significant. I mean way, way, way significant. Consider say a 8000/9000 split, already p≈0. `prop.test(cbind(8000, 9000))` – AkselA May 23 '19 at 13:26
  • 1
    @AskelA raises an important point. Perhaps a more relevant question to ask would be, are there are significant deviations in the value `share0`? This requires assuming a distribution for those values. Going with a normal distribution, a simple-minded test to ask if any values differ from the mean could be `with(subseted, t.test(share0, mu = mean(share0)))`. (This answer is no, p = 0.78) – David O May 23 '19 at 14:07
  • @DavidO makes a very good point - I did not go into it in my answer because the functional programming piece of it is still useful to address. You can change the null hypothesis for each individual proportion test with the argument `p` in `prop.test()`. By default it is 0.5. – qdread May 23 '19 at 14:42
  • I also just noticed this question is probably a duplicate of https://stackoverflow.com/questions/49222353/how-to-use-purrrs-map-function-to-perform-row-wise-prop-tests-and-add-results-t – qdread May 23 '19 at 19:21

2 Answers2

1

The prop.test function is not vectorized. It conducts a single test. You need to explicitly map the function to each row of your data frame. You can use base R functions for that, or tidyverse functions. Here is how you would do it in tidyverse, using purrr::pmap to apply a function to each row of a data frame.

library(purrr)
prop_test_list <- pmap(subseted, function(past_pm1, total, ...) prop.test(x = past_pm1, n = total, alternative = 'two.sided', conf.level = 0.95))

That will return a list of the test objects, with as many elements as you have rows in your data frame.

To extract output from the list in data frame form, you can use purrr::map_dfr. Here is an example with a few summary statistics:

map_dfr(prop_tests, ~ data.frame(p = .x$p.value, estimate = .x$estimate, confint_min = .x$conf.int[1], confint_max = .x$conf.int[2]))

output:

   p            estimate   confint_min confint_max
1  1.037002e-01 0.4049510  0.30058839   0.5181435
2  5.288024e-11 0.1540954  0.09038891   0.2472255
3  2.553365e-03 0.3441353  0.25382739   0.4465844
4  5.115352e-02 0.3853435  0.28114139   0.5005436
5  2.167205e-05 0.1853112  0.09330970   0.3274424
6  1.540307e-04 0.3059511  0.21985913   0.4071514
7  2.490965e-16 0.2957258  0.25231710   0.3430569
8  7.967215e-07 0.2333646  0.15312169   0.3369412
9  2.252910e-13 0.1597164  0.10130585   0.2407265
10 8.851678e-04 0.3765081  0.30807997   0.4500369
qdread
  • 3,389
  • 19
  • 36
0

The base function Vectorize can vectorize a function that doesn't accept vectors. Pay attention to the SIMPLIFY argument. With default value of TRUE, the result is simplified to a vector, array or matrix if possible. Here, it makes more sense to keep it as a list.

vprop.test <- Vectorize(prop.test, SIMPLIFY = FALSE)
ans <- with(subseted, vprop.test(x = past_pm1, n = total))

To extract just the p-values (which are all 0 as noted in the comments) and attach them to the original data frame:

subseted$p.value <- sapply(ans, "[[", "p.value")
David O
  • 803
  • 4
  • 10