2

Below is a portion of something similar to my dataset:

require(dplyr)
alldata
site    date    percent_rank    Label
01A  2013-01-01    0.32         Normal
01B  2013-01-01    0.12         Low
01C  2013-01-01    0.76         High
02A  2013-01-01     0           N/A
02B  2013-01-01    0.16         Low
02C  2013-01-01    0.5          Normal
01A  2013-01-02    0.67         Normal
01B  2013-01-02    0.01         Low
01C  2013-01-02    0.92         High

I assigned each percent_rank a label based on the value (0 to 0.25 to 0.75 to 1 for the three categories). I would now like to produce a summary table in this format:

site  Low  Normal  High  Missing
01A   32   47      92    194
01B   232  23      17    93
01C   82   265     12    6

where each site would have a count of the occurrences of low, normal, and high values for all dates with that site label (there is one for every day of the year), and N/A values would be counted for the "Missing" column.

I have tried the following:

alldata <- %>% group_by(site) %>% mutate(length(Label == "Low"))

which returns the total value of all records, rather than a count of "Low" per site, and

alldata <- %>% group_by(site) %>% mutate(length(which(Label == "Low")))

which returns a value several thousand higher than the total number of records. The idea was that I would repeat this function to create four new columns with four separate mutate lines (one for each category), and this would produce my summary table. I have also tried some variations of aggregate(), although the function component was less clear to me for what I am aiming to do. This seems like it should be a pretty straightforward thing to do (and group_by served me well for calculating the percent rank and associated labels) but I have not been able to find a solution as of yet. Any tips are much appreciated!

acersaccharum
  • 665
  • 1
  • 5
  • 6

3 Answers3

1

There are three ways to do this in dplyr. The first is the most verbose and the other two use convenience functions to shorten the code:

library(reshape2)
library(dplyr)

alldata %>% group_by(site, Label) %>% summarise(n=n()) %>% dcast(site ~ Label)

alldata %>% group_by(site, Label) %>% tally %>% dcast(site ~ Label)

alldata %>% count(site, Label) %>% dcast(site ~ Label)
eipi10
  • 91,525
  • 24
  • 209
  • 285
1

To just produce a summary table, you could use table:

with(df, table(site, Label, useNA="ifany"))[, c(2,4,1,3)]

     Label
site  Low Normal High N/A
  01A   0      2    0   0
  01B   2      0    0   0
  01C   0      0    2   0
  02A   0      0    0   1
  02B   1      0    0   0
  02C   0      1    0   0

data

df <- read.table(header=T, text="site    date    percent_rank    Label
01A  2013-01-01    0.32         Normal
01B  2013-01-01    0.12         Low
01C  2013-01-01    0.76         High
02A  2013-01-01     0           N/A
02B  2013-01-01    0.16         Low
02C  2013-01-01    0.5          Normal
01A  2013-01-02    0.67         Normal
01B  2013-01-02    0.01         Low
01C  2013-01-02    0.92         High")
lmo
  • 37,904
  • 9
  • 56
  • 69
0

We can use dcast from data.table, which also have the fun.aggregate and is very fast.

library(data.table)
dcast(setDT(alldata), site~Label, length)

Or using dplyr/tidyr

library(dplyr)
library(tidyr)
alldata %>%
    group_by(site, Label) %>%
    tally() %>%
    spread(Label, n)

A base R option would be

 reshape(aggregate(date~site + Label, alldata, length), 
           idvar = "site", timevar="Label", direction="wide")
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    This is perfect! I am familiar with the `dplyr` and `tidyr` packages so this aligns well with the format of the rest of my code. Thank you @akrun, and to all for the speedy responses. – acersaccharum Jun 21 '16 at 18:51