It looks like you're comparing apples and oranges. For the single t-test of differences
you're getting a t-statistic, which, if greater than a critical value indicates whether the difference between group1
and group2
is significantly different from zero. Your bootstrapping code does the same thing, but for 10,000 bootstrapped samples of differences
, giving you an estimate of the variation in the t-statistic over different random samples from the population of differences
. If you take the mean of these bootstrapped t-statistics (mean(tstat.values)
) you'll see it's about the same as the single t-statistic from the full sample of differences
.
sum(tstat.values<=-1.96)/Repnumber
gives you the percentage of bootstrapped t-statistics less than -1.96. This is an estimate of the percentage of the time that you would get a t-statistic less than -1.96 in repeated random samples from your population. I think this is essentially an estimate of the power of your test to detect a difference of a given size between group1
and group2
for a given sample size and significance level, though I'm not sure how robust such a power analysis is.
In terms of properly bootstrapping the t-test, I think what you actually need to do is some kind of permutation test that checks whether your actual data is an outlier when compared with repeatedly shuffling the labels on your data and doing a t-test on each shuffled dataset. You might want to ask a question on CrossValidated, in order to get advice on how to do this properly for your data. These CrossValidated answers might help: here, here, and here.