5

I'm new to R and I have data that looks something like this:

categories <- c("A","B","C","A","A","B","C","A","B","C","A","B","B","C","C")
animals <- c("cat","cat","cat","dog","mouse","mouse","rabbit","rat","shark","shark","tiger","tiger","whale","whale","worm")
dat <- cbind(categories,animals)

Some animals repeat according to the category. For example, "cat" appears in all three categories A, B, and C.

I like my new dataframe output to look something like this:

A   B   C   count
1   1   1   1
1   1   0   2
1   0   1   0
0   1   1   2
1   0   0   2
0   1   0   0
0   0   1   2
0   0   0   0

The number 1 under A, B, and C means that the animal appears in that category, 0 means the animal does not appear in that category. For example, the first line has 1s in all three categories. The count is 1 for the first line because "cat" is the only animal that repeats itself in each category.

Is there a function in R that will help me achieve this? Thank you in advance.

Rmeow
  • 113
  • 5

4 Answers4

1

We can use table to create a cross-tabulation of categories and animals, transpose, convert to data.frame, group_by all categories and count the frequency per combination:

library(dplyr)
library(tidyr)

as.data.frame.matrix(t(table(dat))) %>%
  group_by_all() %>%
  summarize(Count = n())

Result:

# A tibble: 5 x 4
# Groups:   A, B [?]
      A     B     C Count
  <int> <int> <int> <int>
1     0     0     1     2
2     0     1     1     2
3     1     0     0     2
4     1     1     0     2
5     1     1     1     1

Edit (thanks to @C. Braun). Here is how to also include the zero A, B, C combinations:

as.data.frame.matrix(t(table(dat))) %>%
  bind_rows(expand.grid(A = c(0,1), B = c(0,1), C = c(0,1))) %>%
  group_by_all() %>%
  summarize(Count = n()-1) 

or with complete, as suggested by @Ryan:

as.data.frame.matrix(t(table(dat))) %>%
  mutate(non_missing = 1) %>%
  complete(A, B, C) %>%
  group_by(A, B, C) %>%
  summarize(Count = sum(ifelse(is.na(non_missing), 0, 1))) 

Result:

# A tibble: 8 x 4
# Groups:   A, B [?]
      A     B     C Count
  <dbl> <dbl> <dbl> <dbl>
1     0     0     0     0
2     0     0     1     2
3     0     1     0     0
4     0     1     1     2
5     1     0     0     2
6     1     0     1     0
7     1     1     0     2
8     1     1     1     1
acylam
  • 18,231
  • 5
  • 36
  • 45
  • FYI, If you want to also include groups with `count = 0`, that isn't trivial because there is nothing like `drop = FALSE` in `dplyr::group_by()`. There are some workarounds here: https://stackoverflow.com/questions/22523131/dplyr-summarise-equivalent-of-drop-false-to-keep-groups-with-zero-length-in – qdread Jun 04 '18 at 17:35
  • 1
    One way to include all possible rows is to include all options before the `group_by`: `bind_rows(expand.grid(A = c(0,1), B = c(0,1), C = c(0,1)))` then change `n()` to `n() - 1`. – C. Braun Jun 04 '18 at 17:36
  • @qdread Yea, I was trying to figure that out. The problem is that those combinations does not actually appear in the `table`, so there is nothing to `drop` – acylam Jun 04 '18 at 17:38
  • @C.Braun Good idea! Let me add that to the answer. – acylam Jun 04 '18 at 17:39
  • @Ryan Yea, I can use `complete`, but I will have to add an indicator for added missing rows that were originally implicit. Using C. Braun's method I am adding all possible combinations of A, B, and C, so all I have to do at the end is `- 1` when counting. See my updates. – acylam Jun 04 '18 at 17:59
1

We have

xxtabs <- function(df, formula) {
    xt <- xtabs(formula, df)
    xxt <- xtabs( ~ . , as.data.frame.matrix(xt))
    as.data.frame(xxt)
}

and

> xxtabs(dat, ~ animals + categories)
  A B C Freq
1 0 0 0    0
2 1 0 0    2
3 0 1 0    0
4 1 1 0    2
5 0 0 1    2
6 1 0 1    0
7 0 1 1    2
8 1 1 1    1

(dat should really be constructed as data.frame(animals, categories)). This base approach uses xtabs() to form the first cross-tabulation

xt <- xtabs(~ animals + categories, dat)

then coerces using as.data.frame.matrix() to a second data.frame, and uses a second cross-tabulation of all columns of the computed data.frame

xxt <- xtabs(~ ., as.data.frame.matrix(xt))

coerced to the desired form

as.data.frame(xxt)

I originally said this approach was 'arcane', because it relies on knowledge of the difference between as.data.frame() and as.data.frame.matrix(); I think of xtabs() as a tool that users of base R should know. I see though that the other solutions also require this arcane knowledge, as well as knowledge of more obscure (e.g., complete(), group_by_all(), funs()) parts of the tidyverse. Also, the other answers are not (or at least not written in a way that allows) easily generalizable; xxtabs() does not actually know anything about the structure of the incoming data.frame, whereas implicit knowledge of the incoming data are present throughout the other answers.

One 'lesson learned' from the tidy approach is to place the data argument first, allowing piping

dat %>% xxtabs(~ animals + categories)
Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
  • _"knowledge of more obscure (e.g., complete(), group_by_all(), funs()) parts of the tidyverse"_. Might be to you, but they are still more human readable than a lot of the Base R functions, IMHO (which is one of main goals of tidyverse) – acylam Jun 05 '18 at 14:33
  • @useR I'm all for the consistency provided by tidyverse, but isn't there a simpler solution? those presented use ~10 functions, whereas base R is using just 3. Looking forward to your implementation! – Martin Morgan Jun 05 '18 at 15:01
  • Not to say that the more the better, but the reason that `tidyverse` is more expressive and easy to write/follow is precisely because it breaks down an operation into multiple smaller steps. It all depends on what you're looking for. Of course Base R with 3 function is nice if you are looking to minimize code, but I would argue that learning each individual step of `tidyverse` is more intuitive than learning the nuances of many Base R functions, because in a lot of cases, Base R functions do things implicitly and less consistent. I agree though that your answer is more generalization. – acylam Jun 05 '18 at 15:19
0

If I understood you correctly, this should do the trick.

require(tidyverse)

 dat %>% 
  mutate(value = 1) %>%
  spread(categories, value) %>%
  mutate_if(is.numeric, funs(replace(., is.na(.), 0))) %>% 
  mutate(count = rowSums(data.frame(A, B, C), na.rm = TRUE)) %>%
  group_by(A, B, C) %>%
  summarize(Count = n()) 

# A tibble: 5 x 4
# Groups:   A, B [?]
      A     B     C Count
  <dbl> <dbl> <dbl> <int>
1    0.    0.    1.     2
2    0.    1.    1.     2
3    1.    0.    0.     2
4    1.    1.    0.     2
5    1.    1.    1.     1
DJV
  • 4,743
  • 3
  • 19
  • 34
0

Adding a data.table solution. First, pivot animals against categories using dat. Then, create the combinations of A, B, C using CJ. Join that combinations with dat and count the number of occurrences for each combi.

dcast(as.data.table(dat), animals ~ categories, length)[
    CJ(A=0:1, B=0:1, C=0:1), .(count=.N), on=c("A","B","C"), by=.EACHI]
chinsoon12
  • 25,005
  • 4
  • 25
  • 35