Testing for Normality in data set with many sites

Question

I'm a comparative newbie to R and am trying to use it to assess the normality (or otherwise) of water quality data from around 1900 individual sites. Each site has a unique Sitecode with the results Meas_res of samples over a 3 years period. Data are held in a .csv file sorted in Sitecode / Sample Date order. I would like to run the Andersen Darling test (and other similar assessments from the nortest package) to get an output in the general form of:

Sitecode, ad test output written back to a .csv file format

Can someone could give me a either a set of code to run the test or guidance as to how to prepare this?

Please provide a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). — Thomas, Jul 09 '13 at 11:53
The code will be a combination of `read.csv`, `ad.test`, and `write.csv`. More help if you provide a reproducible example. — QuantIbex, Jul 09 '13 at 12:02
possible duplicate of [Seeing if data is normally distributed in R](http://stackoverflow.com/questions/7781798/seeing-if-data-is-normally-distributed-in-r) — Brian Diggs, Jul 09 '13 at 17:42

John · Answer 1 · 2013-07-09T19:31:49.670

Without some justification for why you want to run the test, and perhaps an explanation for why you think it will differentiate certain sites, about 100 will come out as non-normal simply by chance. If you want to check if water quality data is normal in general then it's best to check all of the data at once. The means will vary from site to site so what you can check is the residuals of a linear model with the factor Sitecode as a predictor.

library(nortest)
dat <- read.csv( 'myDataFileName.csv' )
m <- lm( Mean_res ~ Sitecode, data = dat)
res <- resid(m)
ad.test(res)

Now, you can do your Anderson Darling test on res.

But just for fun, try generating a few AD tests of your many many samples from a known normal distribution, and look at the qqnorm plots to see what they look like.

y <- rnorm( nrow(dat) )
ad.test(y)
qqnorm(y); qqline(y)

What you'll find with so many points is that you'll still fail the AD test once in a while but the data still looks quite surprisingly normal. So the answer is probably not an AD test. It is probably best to just look at a plot of the residuals and assess normality there.

Going back to my first comment, the normality test only tells you if you can detect a deviation from normality. It's also, just as with t-tests, extremely sensitive at very high N's and gives false alarms at an alpha rate. It does not tell you if data are normal. So, "passing" the tests will not get you a demonstration that data are normal. Given that they are tests against normality what they'll do is show you what sites are not normal (with many false alarms). Without some reason for believing some of the sites aren't normal your planned tests are probably not what you want to be doing.

Testing for Normality in data set with many sites

1 Answers1