2

I can do it for the two sample t test but not for Median test or Wilcoxon test or Hodges Lehmann test

data_2000 <- c(500,450,600,700,550,551,552)

data_2019 <- c(560,460,620,720,540,600,750)

mean(data_2000)

mean(data_2019)

mean(data_2019) - mean(data_2000)

combined_data <- c(data_2000, data_2019)

set.seed(123)

null_dist <- c()
for (i in 1:100000) {
  shuffled_data <- sample(combined_data)  
  shuffled_2000 <- shuffled_data[1:7] 
  shuffled_2019 <- shuffled_data[8:14]  
  null_dist[i] <- mean(shuffled_2019) - mean(shuffled_2000)
}

(p_value <- (sum(null_dist >= 49.57143) + sum(null_dist <= 
 `enter code here`-49.57143))/length(null_dist))
Chuck P
  • 3,862
  • 3
  • 9
  • 20
Z B
  • 131
  • 8
  • Could you provide a [minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)? Provide a sample of data, the code you are trying, and the packages you are using so that we can reproduce your issue and provide a better answer. – Ben Norris Aug 08 '20 at 13:42
  • I can only provide with the one of the two sample since I could not code the others – Z B Aug 08 '20 at 13:46

2 Answers2

2

I think this is what you're trying to do. I altered your code as little as possible. There are packages like infer that will do this for you and the for loop is not the most efficient but it's plenty good enough and may help you learn. As long as we're looping I did mean and median at the same time since all other parts of the code are identical. ifelse is a nice easy way to make 1s and 0s to sum.


data_2000 <- c(500,450,600,700,550,551,552)
data_2019 <- c(560,460,620,720,540,600,750)

delta_mean <- mean(data_2019) - mean(data_2000)
delta_median <- median(data_2019) - median(data_2000)

combined_data <- c(data_2000, data_2019)

trials <- 100000

set.seed(123)

mean_diff <- c()
median_diff <- c()

for (i in 1:trials) {
   shuffled_data <- sample(combined_data)  
   shuffled_2000 <- shuffled_data[1:7] 
   shuffled_2019 <- shuffled_data[8:14]  
   mean_diff[i] <- mean(shuffled_2019) - mean(shuffled_2000)
   median_diff[i] <- median(shuffled_2019) - median(shuffled_2000)
}

p_mean <- sum(ifelse(mean_diff > delta_mean | mean_diff < -1 * delta_mean, 1, 0)) / trials
p_median <- sum(ifelse(median_diff > delta_median | median_diff < -1 * delta_median, 1, 0)) / trials

p_mean
#> [1] 0.31888
p_median
#> [1] 0.24446

Following up on your question about HL test. Quoting Wikipedia

The Hodges–Lehmann statistic also estimates the difference between two populations. For two sets of data with m and n observations, the set of two-element sets made of them is their Cartesian product, which contains m × n pairs of points (one from each set); each such pair defines one difference of values. The Hodges–Lehmann statistic is the median of the m × n differences.

You could run it on your data with the following code...

Do NOT run it 100,000 times the answer is the same everytime because you're already making all 49 possible pairings

hl_df <- expand.grid(data_2019, data_2000)
hl_df$pair_diffs <- hl_df$Var1 - hl_df$Var2
median(hl_df$pair_diffs)
[1] 49
Chuck P
  • 3,862
  • 3
  • 9
  • 20
  • I am very grateful for you Chuck!! Thank you very muchhhhhhhhhhh.I have already found a code for the median in internet and it was exactley like what you have wriiten and like my code just replacing the mean with median .Now The question how to do with Hodges-Lehmann for two sample and one sample and especially wilcoxon ??? does infer package achive that ? ist possible to use the same idea for the latter tests?? – Z B Aug 08 '20 at 16:12
  • I have checked out infer package .it seems poor.I think the best to construct randomized versions of tests is by using the for loop – Z B Aug 08 '20 at 16:24
  • Using the permutation method you don't need to run all the different named tests you mention. See for example http://allendowney.blogspot.com/2016/06/there-is-still-only-one-test.html. i'll edit my answer to show how to run the H-L two sample one time. But **do not run** it 100,000 times because each answer will be identical since it already permutes the data. – Chuck P Aug 09 '20 at 14:33
0

You can do the Wilcoxon test with wilcox.test in the stats package (loaded by default as part of R core). You need to set exact = FALSE because an exact p-value is not possible if there are ties.

wilcox.test(data_2019, data_2000, exact = FALSE)
    Wilcoxon rank sum test with continuity correction

data:  data_2019 and data_2000
W = 33.5, p-value = 0.2769
alternative hypothesis: true location shift is not equal to 0

I'll update this when I figure out how to do the other tests.

Ben Norris
  • 5,639
  • 2
  • 6
  • 15
  • I don't think he was looking for the precise test but a permutation equivalent. – Chuck P Aug 08 '20 at 16:00
  • Thank you Ben !!! But my goal is how to get the p values of the tests which I have mentioned in my question not by applying them and then get their p-values rather by applying randomized versions of them since their assumptions are violated ( samples are smalls) – Z B Aug 08 '20 at 16:22
  • the wilcoxon is not a median test, though – rawr Aug 08 '20 at 19:16