Get frequencies (absolute and relative) of levels of a categorical variable from incidence binary data by combination of columns factors

Question

I would like to have the frequencies of each levels of a categorical variable (row vector) denoting ecological type (3 levels: H,F,T) of a set of 93 herbaceous plants for the observed species present (=1) conditioning by sites (3 levels: A,B,C), habitats (3 levels: 1,2,3,4) and years (3 levels: 1,2,3).

I know the procedure is passed by tapply(), but the messy thing come from the logic operator for linking levels of the categorical variable (H,F,T) for the present species (=1) accross all of the species conditioning by combination of columns factors.

This could be summarized by a 12 x 3 contingency table indicating the numbers of each ecological types (3) of species per sites (3) and habitats (4).

Ex of my data (each habitat contain 20 lines): for each species (Sp1 to Sp93) 0 for absent and 1 for present. Vector "type" contain ecological type for each species.

Site,Habitat,Year,Sp1,Sp2,Sp3,Sp4,Sp5,Sp6,...,Sp93

type= c(H,H,F,T,F,T,H,....T) # vector of length 93

Thank you in advance.

I hope this would help describe my data objects better.

data = read.csv(file = "Veg_06.csv", header = TRUE)

data = data[1:240, -c(1,4:7)]

Ilot # Factor w/ 3 levels "A","B","C": 1 1 1 1 1 1 1 1 1 1 ... each level has 4 sublevels (from "Site") with 20 lines each, adding up to 80 lines by levels.

Site # Factor w/ 4 levels "Am","Av","CP","CS": 2 2 2 2 2 2 2 2 2 2 ...

Sp # int [1:240] 0 0 0 0 0 0 0 0 0 0 ... either "0" or "1" for absence or presence of species.

veg # Factor w/ 3 levels "H","F","T": 3 3 2 2 3 1 2 1 2 1 ... categorical factor indicating type of species.

It would be really helpful if you could post a reproducible example: for example, give the results of `dput` on a subset of your data with a reduced number of species ... — Ben Bolker, Sep 06 '11 at 19:15
PS: as usual, see http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — Ben Bolker, Sep 07 '11 at 15:42
I can't make a subset of my data because each level of site has 4 levels of habitats with each 20 lines representing the species for each quadrat along a transect on this habitat of a site for a given year. So for a single combination of site and habitat there is 80 lines with the columns of species. So even a subset is too big to reproduce here. I though the example of my dataset was comprehensible. — Me., Sep 08 '11 at 18:17

score 1 · Answer 1 · answered May 11 '15 at 18:07

First off, I would recommend http://vita.had.co.nz/papers/tidy-data.pdf, Hadley Wickham's paper on Tidy Data, for some ideas on how to organize the data to be better suited to analysis. In essence, we think of each row as a single observation.

It sounds like fundamentally, your data is a collection of year, site, habitat, quadrant(? maybe line, not sure from the description), species with the observation point being that species was observed in that site, habitat, quadrant, and year. For simplicity, a row is present if the species is present.

In addition, there's the concept of type, which is associated with each species.

Analyzing and contingency table

Putting aside the question of how to get your data into this form, let's assume that we have the data in the form described above.

> raw <- expand.grid(species=1:93, quadrant=1:20, habitat=1:4, site=1:3, year=1:3)
> head(raw)
  species quadrant habitat site year
1       1        1       1    1    1
2       2        1       1    1    1
3       3        1       1    1    1
4       4        1       1    1    1
5       5        1       1    1    1
6       6        1       1    1    1

And let's take a small sample and a large sample

> set.seed(100); d.small <- raw[sample(nrow(raw),20), ]
> set.seed(100); d.large <- raw[sample(nrow(raw),1000), ]

We can use the ftable function to get this into a state that we want, the 12x4 contingency table, as

> ftable(habitat ~ year + site, data=d.small)
          habitat 1 2 3 4
year site
1    1            0 0 1 0
     2            0 0 1 1
     3            0 1 1 1
2    1            2 1 1 0
     2            1 1 0 2
     3            0 0 1 0
3    1            2 0 0 1
     2            0 1 0 1
     3            0 0 0 0

This will count the same species twice if it occurs in two different quadrants of the site/habitat mixture. We can discard the habitat and unique-ify to get the count across all of them

> ftable(habitat ~ year + site , data=unique(d.small[c('species', 'habitat','year','site')]))

Transforming (tidying the source data)

To transform the data as it stands into a form like this is tricky in vanilla R. With the tidyr package it gets easier (reshape does very similar things as well)

> onerow <- data.frame(year=1, site=1, habitat=2, quadrant=3, sp1=0, sp2=1,sp3=0,sp4=0,sp5=1)
> onerow
  year site habitat quadrant sp1 sp2 sp3 sp4 sp5
1    1    1       2        3   0   1   0   0   1

Here I'm making assumptions about what your data look like that seem reasonable

> subset(gather(onerow, species, present, -(year:quadrant)), present==1)
  year site habitat quadrant species present
2    1    1       2        3     sp2       1
5    1    1       2        3     sp5       1
> subset(gather(onerow, species, present, -(year:quadrant)), present==1, select=-present)
  year site habitat quadrant species
2    1    1       2        3     sp2
5    1    1       2        3     sp5

And now you can proceed with the analysis above.

Merging in the species type data

Looking at your description a little closer, I think you also want to merge in a parallel vector of species type information.

> set.seed(100); sp.type <- data.frame(species=1:93, type=factor(sample(1:4, 93, replace=T)))
> merge(d.small, sp.type)
   species quadrant habitat site year type
1        6       16       4    2    3    2
2       27        9       2    2    2    4
3       27        8       4    2    1    4
4       32       18       1    2    2    4
5       33       18       1    1    2    2
6       45       14       4    2    2    3
7       49        6       2    3    1    1
8       54        3       3    2    1    2
9       55        2       1    1    3    3
10      56        2       4    3    1    2
11      56        1       3    1    1    2
12      57        7       2    1    2    1
13      62       18       4    2    2    3
14      70       19       1    1    2    3
15      77        2       3    3    1    4
16      80        7       3    1    2    1
17      81       17       1    1    3    2
18      82        5       2    2    3    3
19      86        9       4    1    3    3
20      87       10       3    3    2    3

And now you can use the subset, unique, and ftable approach above to get the data you need.

IRTFM · Answer 2 · 2011-09-12T18:53:27.683

0

Assuming you had a dataframe with (among other things) the columns named: "sites", "habitats", "years":

dfrm <- data.frame( sites = sample( LETTERS[1:3], 20, replace=TRUE),
            habitats= sample( factor(1:4), 20, replace=TRUE),
            years = sample( factor(paste("Y",1:4, sep="_")), 20, replace=TRUE) )

Then this will give you an additional factor-mode column that encodes the various levels of each row.

dfrm$three.way.inter <- with(dfrm, interaction(sites, habitats, years))

If you want non-populated levels then do nothing else. If you want possible levels that have no instances, then use drop=TRUE. Then you can analyze these within individual levels of the three classification variables.

edited Sep 12 '11 at 18:53

answered Sep 06 '11 at 19:47

IRTFM

258,963
21
364
487

So is dfrm the three classification variables ? and also what means non-populated levels ? thank you very much. – Me. Sep 12 '11 at 18:10
"dfrm" is simply the name I use for the dataframe that holds the input data when no example is posted. I will post the assumed data layout that was implied by your verbal description. – IRTFM Sep 12 '11 at 18:41
In fact, sites has length 4x20 = 80 lines. My question was rather HOW to relate an incidence matrix with a separate vector to find frequencies of each level of that vector for each site levels. Can we do that using the "with" function ? – Me. Sep 19 '11 at 19:17
`with` is just a convenience function so you don't need to keep writing "dfrm$" in front of esach column name. It is like a localized version of `attach`. The calculation of the interaction variable will not be upset by additional lines. Please describe your data objects better. Perhaps just attaching the results of `str` on all of them – IRTFM Sep 19 '11 at 19:30

Get frequencies (absolute and relative) of levels of a categorical variable from incidence binary data by combination of columns factors

2 Answers2

Analyzing and contingency table

Transforming (tidying the source data)

Merging in the species type data