First off, I would recommend http://vita.had.co.nz/papers/tidy-data.pdf, Hadley Wickham's paper on Tidy Data, for some ideas on how to organize the data to be better suited to analysis. In essence, we think of each row as a single observation.
It sounds like fundamentally, your data is a collection of year
, site
, habitat
, quadrant
(? maybe line
, not sure from the description), species
with the observation point being that species was observed in that site, habitat, quadrant, and year. For simplicity, a row is present if the species is present.
In addition, there's the concept of type
, which is associated with each species.
Analyzing and contingency table
Putting aside the question of how to get your data into this form, let's assume that we have the data in the form described above.
> raw <- expand.grid(species=1:93, quadrant=1:20, habitat=1:4, site=1:3, year=1:3)
> head(raw)
species quadrant habitat site year
1 1 1 1 1 1
2 2 1 1 1 1
3 3 1 1 1 1
4 4 1 1 1 1
5 5 1 1 1 1
6 6 1 1 1 1
And let's take a small sample and a large sample
> set.seed(100); d.small <- raw[sample(nrow(raw),20), ]
> set.seed(100); d.large <- raw[sample(nrow(raw),1000), ]
We can use the ftable
function to get this into a state that we want, the 12x4 contingency table, as
> ftable(habitat ~ year + site, data=d.small)
habitat 1 2 3 4
year site
1 1 0 0 1 0
2 0 0 1 1
3 0 1 1 1
2 1 2 1 1 0
2 1 1 0 2
3 0 0 1 0
3 1 2 0 0 1
2 0 1 0 1
3 0 0 0 0
This will count the same species twice if it occurs in two different quadrants of the site/habitat mixture. We can discard the habitat and unique
-ify to get the count across all of them
> ftable(habitat ~ year + site , data=unique(d.small[c('species', 'habitat','year','site')]))
Transforming (tidying the source data)
To transform the data as it stands into a form like this is tricky in vanilla R. With the tidyr
package it gets easier (reshape
does very similar things as well)
> onerow <- data.frame(year=1, site=1, habitat=2, quadrant=3, sp1=0, sp2=1,sp3=0,sp4=0,sp5=1)
> onerow
year site habitat quadrant sp1 sp2 sp3 sp4 sp5
1 1 1 2 3 0 1 0 0 1
Here I'm making assumptions about what your data look like that seem reasonable
> subset(gather(onerow, species, present, -(year:quadrant)), present==1)
year site habitat quadrant species present
2 1 1 2 3 sp2 1
5 1 1 2 3 sp5 1
> subset(gather(onerow, species, present, -(year:quadrant)), present==1, select=-present)
year site habitat quadrant species
2 1 1 2 3 sp2
5 1 1 2 3 sp5
And now you can proceed with the analysis above.
Merging in the species type data
Looking at your description a little closer, I think you also want to merge in a parallel vector of species type information.
> set.seed(100); sp.type <- data.frame(species=1:93, type=factor(sample(1:4, 93, replace=T)))
> merge(d.small, sp.type)
species quadrant habitat site year type
1 6 16 4 2 3 2
2 27 9 2 2 2 4
3 27 8 4 2 1 4
4 32 18 1 2 2 4
5 33 18 1 1 2 2
6 45 14 4 2 2 3
7 49 6 2 3 1 1
8 54 3 3 2 1 2
9 55 2 1 1 3 3
10 56 2 4 3 1 2
11 56 1 3 1 1 2
12 57 7 2 1 2 1
13 62 18 4 2 2 3
14 70 19 1 1 2 3
15 77 2 3 3 1 4
16 80 7 3 1 2 1
17 81 17 1 1 3 2
18 82 5 2 2 3 3
19 86 9 4 1 3 3
20 87 10 3 3 2 3
And now you can use the subset
, unique
, and ftable
approach above to get the data you need.