Splitting a large vector into intervals in R

Question

I'm not too good with R. I ran this loop and I have this huge resulting vector of 11,303,044 rows. I have another vector resulting from another loop with dimensions 1681 rows.

I wish to run a chisq.test to compare their distributions. but since they are of different length, it's not working.

I tried taking 1681-sized samples from the 11,303,044-sized vector to match the size length of the 2nd vector but I get different chisq.test results every time I run it.

I'm thinking splitting the 2 vectors into equal number of intervals.

Let's say

vector1:

temp.mat<-matrix((rnorm(11303044))^2, ncol=1) 
head(temp.mat)
dim(temp.mat)

vector2:

temp.mat<-matrix((rnorm(1681))^2, ncol=1) 
head(temp.mat)
dim(temp.mat)

How do I split them in equal intervals to result in same lengths vectors?

Any test is almost surely going to be highly significant due to the large numbers of cases. It might make more sense to compare with a qqplot (perhaps with a bit of sampling to reduce the plotting load.) — IRTFM, Oct 10 '13 at 22:59

IRTFM · Answer 1 · 2013-10-10T23:33:57.150

1

mat1<-matrix((rnorm(1130300))^2, ncol=1) # only one-tenth the size of your vector
smat=sample(mat1, 100000)                #and take only one-tenth of that
mat2<-matrix((rnorm(1681))^2, ncol=1)
qqplot(smat,mat2)                       #and repeat the sampling a few times

What you see seems interesting from a statistical point of view. At the higher levels of "departure from the mean" the large sample is always departing from a "good fit" not surprisingly because it has a higher number of really extreme values.

edited Oct 10 '13 at 23:33

answered Oct 10 '13 at 23:05

IRTFM

258,963
21
364
487

Thanks for your help DWin this generated an interesting plot. – Yeshyyy Oct 10 '13 at 23:32
Your effort to compare qualitatively with histograms seems reasonable , especially if you constrain 'breaks" to be the same. The `qqplot` results in a finer grained comparison, and would disclose any "high frequency" differences or weird "tail behavior". But your search for a "significance test" is not recommended on such large samples since it will almost always be "significant" but often meaningless. – IRTFM Oct 10 '13 at 23:37

mrip · Accepted Answer · 2013-10-10T23:41:08.183

0

chisq.test is Pearson's chi-square test. It is designed for discrete data, and with two input vectors, it will coerce the inputs you pass in to factors, and it tests for independence, not equality in distribution. This means, for example, that the order of the data will make a difference.

> set.seed(123)
> x<-sample(5,10,T)
> y<-sample(5,10,T)
> chisq.test(x,y)

    Pearson's Chi-squared test

data:  x and y
X-squared = 18.3333, df = 16, p-value = 0.3047

Warning message:
In chisq.test(x, y) : Chi-squared approximation may be incorrect
> chisq.test(x,y[10:1])

    Pearson's Chi-squared test

data:  x and y[10:1]
X-squared = 16.5278, df = 16, p-value = 0.4168

Warning message:
In chisq.test(x, y[10:1]) : Chi-squared approximation may be incorrect

So I don't think that chisq.test is what you want, because it does not compare distributions. Maybe try something like ks.test, which will work with different length vectors and continuous data.

> set.seed(123)
> x<-rnorm(2000)^2
> y<-rnorm(100000)^2
> ks.test(x,y)

    Two-sample Kolmogorov-Smirnov test

data:  x and y
D = 0.0139, p-value = 0.8425
alternative hypothesis: two-sided

> ks.test(sqrt(x),y)

    Two-sample Kolmogorov-Smirnov test

data:  sqrt(x) and y
D = 0.1847, p-value < 2.2e-16
alternative hypothesis: two-sided

edited Oct 10 '13 at 23:41

answered Oct 10 '13 at 23:09

mrip

14,913
4
40
58

I tried kolmogorov's test. however, because my first vector is of size 11,303,044, R printed: Error: cannot allocate vector of size 86.2 Mb. – Yeshyyy Oct 10 '13 at 23:24
How much memory do you have on your machine? Try calling `gc()` before the KS test. Or you could simply do a `t.test`. – mrip Oct 10 '13 at 23:26
Another option is to do the KS test with a random subset of the data. For example, `ks.test(x,sample(y,10000))`. – mrip Oct 10 '13 at 23:28
Sorry I might've expressed myself wrong. I have a dataset of related/unrelated individuals. The 2 vectors I obtained(from 2 different loops) are related indiivduals, unrelated individuals. So what I'm really looking for is proving that the 2 vectors resulting from their respective loops are different. Because when I generated their respective histograms, They looked completely different. Thanks for your help :) – Yeshyyy Oct 10 '13 at 23:30
Yup ks.test(x,sample(y,10000)) worked. I even took a sample of 1,000,000 from the large vector. thanks heaps:) – Yeshyyy Oct 10 '13 at 23:33

Splitting a large vector into intervals in R

2 Answers2

Linked