R chi-squared statistic for two different distribution

Question

I have two file.dat (random1.dat and random2.dat) which are generated from a random uniform distribution (changing the seed):

http://www.filedropper.com/random1_1: random1.dat http://www.filedropper.com/random2 : random2.dat

I like to use R to make the X-squared to understand if the two distribution are statistically the same. To do that i prove:

x1 -> read.table("random1.dat")
x2 -> read.table("random2.dat")
chisq.test(x1,x2)

but I receive an error message:

'x' and 'y' need to have the same length

Now the problem is that this two files are both 1000's rows. So I don't understand that. Another question is if I want to make this process automatic (iterate it) for istance 100 times with 100 different file, can i make something like:

DO i=1,100
x1 -> read.table("random'(i)'.dat")
x2 -> read.table("fixedfile.dat")
chisq.test(x1,x2)
save results from the chisq analys
END DO

Thanks so much for Your help.

ADDED:

@eipi10,

I try to use the first method You gave here and it works well for the data You generate here. Then, when I try it for my data (I put in a single file a 2-column matrix enter link description here of 1000 rows of two uniform distribution with a different seed) something do not work correctly:

I load the file with: dat = read.table("random2col.dat");
I use the command: csq = lapply(dat[,-1], function(x) chisq.test(cbind(dat[,1],x))) and a warning message appear;
finally I use: unlist(lapply(csq, function(x) x$p.value)) BUT the output is something like:

[...] 1 1 1 1 1 1 1 1 1 1 1 1 1
[963] 1 1 1 1 1.....1 1 1 1
[1000] 1

Your assignments are backwards. Try using `<-` or `=` instead of `->` in your `read.table` lines. For doing this 100 times with different files, [this question](http://stackoverflow.com/q/5799096/903061) should get you started, or [this one](http://stackoverflow.com/a/14958740/903061). — Gregor Thomas, Jun 05 '14 at 23:32

eipi10 · Answer 1 · 2014-06-11T19:53:48.463

I don't think you need to use a loop. You can use lapply instead. Also, you're entering x1 and x2 as separate columns of data. When you do this, chisq.test computes a contingency table from these two columns, which wouldn't be meaningful for columns of real numbers. Instead, you need to feed chisq.test a single matrix or data frame whose columns are x1 and x2. But even then, the chisq.test is expecting count data, which isn't what you have here (although the "expected" frequency doesn't necessarily have to be an integer). In any case, here's some code that will make the test run the way you seem to be hoping:

# Simulate data: 5 columns of data, each from the uniform distribution
dat = data.frame(replicate(5, runif(20)))

# Chi-Square test of each column against column 1.
# Note use of cbind to combine the two columns into a single data frame, 
# rather than entering each column as separate arguments.
csq = lapply(dat[,-1], function(x) chisq.test(cbind(dat[,1],x)))

# Look at Chi-square stats and p-Values for each test
sapply(csq, function(x) x$statistic)
sapply(csq, function(x) x$p.value)

On the other hand, if you were intending your data to be two streams of values that would then be converted into a contingency table, here's an example of that:

# Simulate data of 5 factor variables, each with 10 different levels
dat = data.frame(replicate(5, sample(c(1:10), 1000, replace=TRUE)))

# Chi-Square test of each column against column 1. Here the two columns of data are 
# entered as separate arguments, so that chisq.test will convert them to a two-way 
# contingency table before doing the test.
csq = lapply(dat[,-1], function(x) chisq.test(dat[,1],x))

# Look at Chi-square stats and p-Values for each test
sapply(csq, function(x) x$statistic)
sapply(csq, function(x) x$p.value)

R chi-squared statistic for two different distribution

1 Answers1