Count occurrence in one variable based on another

Question

I have the following structured table (as an example):

   Class 1    Class 2
1   1           1
2   1           1
3   1           1
4   1           2
5   3           3
6   3           3
7   3           4
8   4           4

I want to count how many times in a given Class 1 the same value appear in Class 2 and display this as a percentage value. Also group class 1. So I would want the result to be something like this:

 Class 1     n_class1    Percentage of occurrence in class 2 
1   1           4                  0.75
2   3           3                  0.666
3   4           1                  1.0

I have read a lot about the dplyr package and think the solution can be in there, and also looked at many examples but have not yet found a solution. I'm new to programming so don't have the natural programmer thinking yet, hope someone can give me tips on how to to this.

I have manage to get the n_class1 by using group by but struggling to get the the percentage of occurrence in class 2.

use your group by on class 2 then `percentage[i] = n_class2[i] / n_class1[i]` — xDreamCoding, Apr 08 '17 at 11:25

mt1022 · Accepted Answer · 2017-04-08T11:40:25.997

3

you can do this by creating a new column in.class1 with mutate:

library(dplyr)
df <- data.frame(
    class1 = rep(c(1, 3, 4), c(4, 3, 1)),
    class2 = rep(c(1, 2, 3, 4), c(3, 1, 2, 2))
)

df %>%
    mutate(in.class1 = class2 == class1) %>%
    group_by(class1) %>%
    summarise(n_class1 = n(),
              class2_percentile = sum(in.class1) / n()
    )

# # A tibble: 3 × 3
#   class1 n_class1 class2_percentile
#    <dbl>    <int>             <dbl>
# 1      1        4         0.7500000
# 2      3        3         0.6666667
# 3      4        1         1.0000000

As suggested by Jaap in comment, this could be simplified to:

df %>%
    group_by(class1) %>%
    summarise(
        n_class1 = n(),
        class2_percentile = sum(class1 == class2) / n())

edited Apr 08 '17 at 11:40

answered Apr 08 '17 at 11:26

mt1022

16,834
5
48
71

no need for the mutate step, just `df %>% group_by(Class1) %>% summarise(n = n(), perc = sum(Class1 == Class2)/n)` gives the same result – Jaap Apr 08 '17 at 11:34
@Jaap, that's better. I would like to add this into answer. – mt1022 Apr 08 '17 at 11:38
Could I also display in an additional column how many different class 2 is connected to every class 1? – staanR Apr 08 '17 at 12:50
@staanR, you can try `n_class2_in_class1 = sum(class1 == class2)` to `summarize` – mt1022 Apr 08 '17 at 12:56

score 1 · Answer 2 · edited May 23 '17 at 11:54

The question has already been asked as part of a larger question the OP has asked before where it has been answered using data.table.

Read data

library(data.table)
cl <- fread(
  "id   Class1    Class2
  1   1           1
  2   1           1
  3   1           1
  4   1           2
  5   3           3
  6   3           3
  7   3           4
  8   4           4"
)

Aggregate

cl[, .(.N, share_of_occurence_in_Class2 = sum(Class1 == Class2)/.N), by = Class1]
#   Class1 N share_of_occurence_in_Class2
#1:      1 4                    0.7500000
#2:      3 3                    0.6666667
#3:      4 1                    1.0000000

Count occurrence in one variable based on another

2 Answers2

Read data

Aggregate

Linked