0

I am studying Didzis' p-value corrgram with different input data examples, where his insignificant p-value (p < 0.05) corresponds to almost a perfect curve fit, which is strange, see Fig 1-3.

Fig. 1 Output of the "extreme" input data #1, Fig. 2 Output with minimum input data #2, Fig. 3 Output with Didzis' input data #3,

enter image description here enter image description here enter image description here

Statistical inspection.

  • Fig. 1 p-values are very high when r small,
  • Fig. 2 p-values are very high but confidence intervals much be wide, not sure if drawing a graph there is appropriate,
  • Fig. 3 very low p-values when curve fitting almost perfect - this observation can be confusing

Input data test cases

Real live data example #1 as "extreme" example and its application output in Fig. 1

## 1 To make a list of lists
set.seed(24)
A=541650
m1 <- matrix(1:A, ncol=4, nrow=A)
str(m1)

a=360; b=1505; c=4;
m2 <- array(`length<-`(m1, a*b*c), dim = c(a,b,c))

res <- lapply(seq(dim(m2)[3]), function(i) cor(m2[,,i]))
str(res)

res <- lapply(res, function(x) eigen(replace(x, is.na(x), 0))$vectors[,1:1])    
str(res)

Minimum example #2 and its application output in Fig. 2

A <- 1505
res <- list(rnorm(A), rnorm(rnorm(A)), rnorm(rnorm(rnorm(A))), rnorm(rnorm(rnorm(rnorm(A)))))
str(res)

Standard input example is Didzis used US election data #3 in Fig. 3

res <- USJudgeRatings[,c(2:3,6,1,7)] 

To make the p-value corrgram

## 2 Didzis https://stackoverflow.com/a/15271627/54964
panel.cor <- function(x, y, digits=2, cex.cor)
{
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(0, 1, 0, 1))
  r <- abs(cor(x, y))
  txt <- format(c(r, 0.123456789), digits=digits)[1]
  test <- cor.test(x,y)
  Signif <- ifelse(round(test$p.value,3)<0.001,"p<0.001",paste("p=",round(test$p.value,3)))
  text(0.5, 0.25, paste("r=",txt))
  text(.5, .75, Signif)
}

panel.smooth<-function (x, y, col = "blue", bg = NA, pch = 18,
                        cex = 0.8, col.smooth = "red", span = 2/3, iter = 3, ...)
{
  points(x, y, pch = pch, col = col, bg = bg, cex = cex)
  ok <- is.finite(x) & is.finite(y)
  if (any(ok))
    lines(stats::lowess(x[ok], y[ok], f = span, iter = iter),
          col = col.smooth, ...)
}

panel.hist <- function(x, ...)
{
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(usr[1:2], 0, 1.5) )
  h <- hist(x, plot = FALSE)
  breaks <- h$breaks; nB <- length(breaks)
  y <- h$counts; y <- y/max(y)
  rect(breaks[-nB], 0, breaks[-1], y, col="cyan", ...)
}

data <- res
str(data)

pairs(data,
          lower.panel=panel.smooth, upper.panel=panel.cor,diag.panel=panel.hist)

About significant upperbound

The source says that the study which is not statistically siginificant with 15K points may become significant with 2-3M points. My observation is that it becomes signifant with 6-7M with my data sample and study, data 541650 541650 6925867. So I think there is no problem in plotting so big data sets in Didzis' p-value corrgram in theory. His algorithms are making possibly some simplifications, or causing clusterisation of the points such that many figures look like with a increasing diagonal or with y=0 line.

OS: Debian 8.5
R: 3.3.1

Community
  • 1
  • 1
Léo Léopold Hertz 준영
  • 134,464
  • 179
  • 445
  • 697
  • 1
    As a side note, you mention ">=1500" points - just be careful depending on how much greater than this you're talking, because p-values go to zero as the data size rises, making them no longer a useful (such as they ever were) measure of significance. – Jeff Nov 05 '16 at 17:13
  • @JeffL. Yes, my example shows that. Can you find any source about the upper bound for the data set because I have had difficulties to find it. – Léo Léopold Hertz 준영 Nov 05 '16 at 17:15
  • 1
    You would probably get a more thorough explanation on Cross Validated, but one of my colleagues wrote up a bit of a proof that might be helpful, found here in bookdown (with proof in the appendix): https://bookdown.org/SarahArmstrong/spark-social-science-manual/econometrics-and-large-scale-data.html – Jeff Nov 05 '16 at 17:22
  • @JeffL. It actually does not say anything about absolute upperbounds. I think there are for each case own. It only says that 15k sample may become significant in 1-2M sample sizes. Actually, that is also my observation - - I get statistically significant values with 6-7M points but cannot plot them yet, data `541650 541650 6925867`. – Léo Léopold Hertz 준영 Nov 05 '16 at 17:29

0 Answers0