Count number of occurrences of categorical variables in data frame (R)

Question

I have data frame:

station     date        classification
 1    June - 01/16          A
 2    June - 03/16          B
 1    June - 01/16          A
 7    June - 01/16          C
 1    June - 03/16          A
 2    June - 03/16          B
 2    June - 03/16          B

I want to get the total number of occurrences of A, B and C, aggregated by the station # and date:

For example, station 1 on June 01 has 2 As, while station 2 on June 3 has 3 Bs.

I tried,

aggregate(x = list(data_frame$classification), by = list(station=data_frame$station, Date=data_frame$date), function(x) length(unique(x))

Okay, thank you. I still get an error. "not all arguments have the same length" — Cybernetic, Apr 18 '16 at 17:03
I guess you want a `table` i.e `aggregate(classification~., data_frame, FUN= table)` because if you are using `length(unique(x))` then it will be all 1s. — akrun, Apr 18 '16 at 17:05
Ah nice...so that worked when only doing it by station. Trying to aggregate it by both station and date is causing the problem. Can't you aggregate by more than one column? — Cybernetic, Apr 18 '16 at 17:12

akrun · Accepted Answer · 2016-04-18T17:22:00.637

7

If we need the count of 'A', 'B' and 'C', it may be better to reshape. We convert the 'data.frame' to 'data.table' (setDT(data_frame)) and use dcast from data.table to reshape from 'long' to 'wide' format, specifying the fun.aggregate as length.

library(data.table)
dcast(setDT(data_frame), station+date~classification, length)
#   station         date A B C
#1:       1 June - 01/16 2 0 0
#2:       1 June - 03/16 1 0 0
#3:       2 June - 03/16 0 3 0
#4:       7 June - 01/16 0 0 1

A dplyr option is

library(dplyr)
data_frame %>%
        group_by(station, date, classification) %>%
        tally()
# station         date classification     n
#    (int)        (chr)          (chr) (int)
#1       1 June - 01/16              A     2
#2       1 June - 03/16              A     1
#3       2 June - 03/16              B     3
#4       7 June - 01/16              C     1

data

data_frame <- structure(list(station = c(1L, 2L, 1L, 7L, 1L, 2L, 2L), 
date = c("June - 01/16", 
"June - 03/16", "June - 01/16", "June - 01/16", "June - 03/16", 
"June - 03/16", "June - 03/16"), classification = c("A", "B", 
"A", "C", "A", "B", "B")), .Names = c("station", "date", "classification"
), class = "data.frame", row.names = c(NA, -7L))

edited Apr 18 '16 at 17:22

answered Apr 18 '16 at 17:10

akrun

874,273
37
540
662

I guess reshape2's also works here? – Frank Apr 18 '16 at 17:12
@Frank yes, but I thought that `data.table` would be fast – akrun Apr 18 '16 at 17:13
What is the setDT? I am getting the error "Columns specified in formula can not be of type list" – Cybernetic Apr 18 '16 at 17:19
@Cybernetic I updated with the dataset I used. `setDT` converts the data.frame to data.table. Have you loaded the `library(data.table)` – akrun Apr 18 '16 at 17:22
Yes. Is setDT a method you wrote? I usually convert using data.table() – Cybernetic Apr 18 '16 at 19:13
Ah got it. Working now...thank you. – Cybernetic Apr 18 '16 at 19:19
@Cybernetic Glad to know that it works. The `setDT` method was already there for at tleast 2 recent versions of `data.table`. – akrun Apr 19 '16 at 02:23

shirewoman2 · Answer 2 · 2016-04-18T17:42:34.070

1

The package plyr is great for this.

library(plyr) 
count(data_frame, c("classification", "station", "date"))

edited Apr 18 '16 at 17:42

answered Apr 18 '16 at 17:12

shirewoman2

1,842
4
19
31

But how do you specify that "classification" is what is being counted? – Cybernetic Apr 18 '16 at 17:24
Sorry, I missed that "classification" was in there. Edited to reflect that. – shirewoman2 Apr 18 '16 at 17:42

score 0 · Answer 3 · answered Apr 18 '16 at 18:07

0

sql way.

sqldf("select station, date ,classification, count(classification) from x group by station, date ,classification")

answered Apr 18 '16 at 18:07

Chirayu Chamoli

2,076
1
17
32

Count number of occurrences of categorical variables in data frame (R)

3 Answers3

data

Linked