I am studying Didzis' p-value corrgram with different input data examples, where his insignificant p-value (p < 0.05) corresponds to almost a perfect curve fit, which is strange, see Fig 1-3.
Fig. 1 Output of the "extreme" input data #1, Fig. 2 Output with minimum input data #2, Fig. 3 Output with Didzis' input data #3,
Statistical inspection.
- Fig. 1 p-values are very high when r small,
- Fig. 2 p-values are very high but confidence intervals much be wide, not sure if drawing a graph there is appropriate,
- Fig. 3 very low p-values when curve fitting almost perfect - this observation can be confusing
Input data test cases
Real live data example #1 as "extreme" example and its application output in Fig. 1
## 1 To make a list of lists
set.seed(24)
A=541650
m1 <- matrix(1:A, ncol=4, nrow=A)
str(m1)
a=360; b=1505; c=4;
m2 <- array(`length<-`(m1, a*b*c), dim = c(a,b,c))
res <- lapply(seq(dim(m2)[3]), function(i) cor(m2[,,i]))
str(res)
res <- lapply(res, function(x) eigen(replace(x, is.na(x), 0))$vectors[,1:1])
str(res)
Minimum example #2 and its application output in Fig. 2
A <- 1505
res <- list(rnorm(A), rnorm(rnorm(A)), rnorm(rnorm(rnorm(A))), rnorm(rnorm(rnorm(rnorm(A)))))
str(res)
Standard input example is Didzis used US election data #3 in Fig. 3
res <- USJudgeRatings[,c(2:3,6,1,7)]
To make the p-value corrgram
## 2 Didzis https://stackoverflow.com/a/15271627/54964
panel.cor <- function(x, y, digits=2, cex.cor)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits=digits)[1]
test <- cor.test(x,y)
Signif <- ifelse(round(test$p.value,3)<0.001,"p<0.001",paste("p=",round(test$p.value,3)))
text(0.5, 0.25, paste("r=",txt))
text(.5, .75, Signif)
}
panel.smooth<-function (x, y, col = "blue", bg = NA, pch = 18,
cex = 0.8, col.smooth = "red", span = 2/3, iter = 3, ...)
{
points(x, y, pch = pch, col = col, bg = bg, cex = cex)
ok <- is.finite(x) & is.finite(y)
if (any(ok))
lines(stats::lowess(x[ok], y[ok], f = span, iter = iter),
col = col.smooth, ...)
}
panel.hist <- function(x, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(usr[1:2], 0, 1.5) )
h <- hist(x, plot = FALSE)
breaks <- h$breaks; nB <- length(breaks)
y <- h$counts; y <- y/max(y)
rect(breaks[-nB], 0, breaks[-1], y, col="cyan", ...)
}
data <- res
str(data)
pairs(data,
lower.panel=panel.smooth, upper.panel=panel.cor,diag.panel=panel.hist)
About significant upperbound
The source says that the study which is not statistically siginificant with 15K points may become significant with 2-3M points.
My observation is that it becomes signifant with 6-7M with my data sample and study, data 541650 541650 6925867
.
So I think there is no problem in plotting so big data sets in Didzis' p-value corrgram in theory.
His algorithms are making possibly some simplifications, or causing clusterisation of the points such that many figures look like with a increasing diagonal or with y=0 line.
OS: Debian 8.5
R: 3.3.1