In R, how do I easily return the proportion of common unique elements between combinations of categories?

Question

I have large categorical datasets, and want to quickly generate reciprocal tables that compare the proportion of common elements between categories.

Example of input:

df <- structure(list(YEAR = structure(c(5L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 
1L), .Label = c("2013", "2014", "2015", "2016", "2017"), class = "factor"), 
    Entry_Number_F = c(3170L, 3182L, 3169L, 3178L, 3180L, 3181L, 
    3097L, 3168L, 3164L, 3179L, 3171L, 3169L, 3170L, 3178L, 3097L, 
    3177L, 3168L, 3164L, 3179L, 3097L, 3164L, 3168L, 3169L, 3170L, 
    3171L, 3172L, 3173L, 3174L, 3175L, 3176L, 3097L, 3164L, 3168L, 
    3169L, 3170L, 3097L, 3156L, 3168L, 3169L, 3170L)), .Names = c("YEAR", 
"Entry_Number_F"), row.names = c(181L, 182L, 183L, 184L, 186L, 
196L, 199L, 202L, 204L, 768L, 3213L, 3948L, 3950L, 3954L, 3957L, 
3958L, 3963L, 3964L, 3969L, 7836L, 7837L, 7838L, 7839L, 7840L, 
7841L, 7842L, 7843L, 7844L, 7845L, 7846L, 10785L, 10786L, 10787L, 
10788L, 10789L, 13679L, 13680L, 13681L, 13682L, 13683L), class = "data.frame")

Desired output:

structure(list(X2017 = c(1, 0.6666667, 0.4888889, 0.4888889, 
0.4888889), X2016 = c(0.7317073, 1, 0.6585366, 0.6341463, 0.6097561
), X2015 = c(0.44, 0.54, 1, 0.68, 0.58), X2014 = c(0.468, 0.553, 
0.723, 1, 0.702), X2013 = c(0.423, 0.481, 0.557, 0.6346, 1)), .Names = c("X2017", 
"X2016", "X2015", "X2014", "X2013"), class = "data.frame", row.names = c(2017L, 
2016L, 2015L, 2014L, 2013L))

I can get the answer I want in a kludgey way, as follows:

unique(df$YEAR)

Subset these years.

Year1 <- "2017"
Year2 <- "2016"
Year3 <- "2015"
Year4 <- "2014"
Year5 <- "2013"

Subset each of the years.

df.2017 <- droplevels(subset(df, YEAR=="2017"))
df.2016 <- droplevels(subset(df, YEAR=="2016"))
df.2015 <- droplevels(subset(df, YEAR=="2015"))
df.2014 <- droplevels(subset(df, YEAR=="2014"))
df.2013 <- droplevels(subset(df, YEAR=="2013"))

Find the proportion of common entries between years.

length(Reduce(intersect, 
              list(df.2017$Entry_Number_F, 
                   df.2017$Entry_Number_F
              )))/length(unique(df.2017$Entry_Number_F))

length(Reduce(intersect, 
              list(df.2017$Entry_Number_F, 
                   df.2016$Entry_Number_F
              )))/length(unique(df.2017$Entry_Number_F))

length(Reduce(intersect, 
              list(df.2017$Entry_Number_F, 
                   df.2015$Entry_Number_F
              )))/length(unique(df.2017$Entry_Number_F))

length(Reduce(intersect, 
              list(df.2017$Entry_Number_F, 
                   df.2015$Entry_Number_F
              )))/length(unique(df.2017$Entry_Number_F))

length(Reduce(intersect, 
              list(df.2017$Entry_Number_F, 
                   df.2014$Entry_Number_F
              )))/length(unique(df.2017$Entry_Number_F))

length(Reduce(intersect, 
              list(df.2017$Entry_Number_F, 
                   df.2013$Entry_Number_F
              )))/length(unique(df.2017$Entry_Number_F))

This obviously isn't ideal. I've experimented with some things, such as the code below, and something similar to the link here example, but it doesn't give me the reciprocal table in a simple number of steps.

x <- df %>% group_by(YEAR) %>% mutate(count = n_distinct(Entry_Number_F))

x <- aggregate(Entry_Number_F ~ YEAR, Data.input, function(x) unique(x))

Can any one suggest a straight forward way of doing this?

Many thanks!

Edit:

I appreciate the quick feedback. Unfortunately, neither of the following suggestions returns the correct answer, I don’t think. This could be because I didn’t make my question clear enough and/or that my example was not sufficient.

This dataset is more comparable to my actual data, although for clarity I have dropped a number of columns.

df2 <- structure(list(YEAR = structure(c(5L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L), .Label = c("2013", "2014", "2015", "2016", 
"2017"), class = "factor"), Entry_Number_F = c(3170L, 3182L, 
3169L, 3178L, 3169L, 3180L, 3180L, 3170L, 3182L, 3178L, 3170L, 
3180L, 3178L, 3169L, 3182L, 3181L, 3181L, 3181L, 3097L, 3097L, 
3097L, 3168L, 3168L, 3164L, 3168L, 3164L, 3164L, 3097L, 3164L, 
3168L, 3169L, 3170L, 3178L, 3180L, 3181L, 3182L, 3180L, 3182L, 
3180L, 3182L, 3169L, 3178L, 3170L, 3182L, 3169L, 3097L, 3178L, 
3170L, 3181L, 3169L, 3170L, 3178L, 3181L, 3180L, 3181L, 3168L, 
3164L, 3097L, 3097L, 3164L, 3168L, 3168L, 3164L, 3097L, 3164L, 
3168L, 3169L, 3170L, 3178L, 3180L, 3181L, 3182L, 3182L, 3182L, 
3178L, 3168L, 3170L, 3170L, 3178L, 3169L, 3181L, 3169L, 3097L, 
3168L, 3182L, 3164L, 3097L, 3178L, 3169L, 3181L, 3180L, 3164L, 
3181L, 3164L, 3097L, 3168L, 3180L, 3170L, 3180L, 3097L, 3164L, 
3168L, 3169L, 3170L, 3178L, 3180L, 3181L, 3182L, 3170L, 3181L, 
3182L, 3169L, 3180L, 3169L, 3169L, 3182L, 3170L, 3180L, 3182L, 
3180L, 3170L, 3168L, 3181L, 3178L, 3097L, 3178L, 3168L, 3178L, 
3164L, 3097L, 3097L, 3181L, 3164L, 3168L, 3164L, 3097L, 3164L, 
3168L, 3169L, 3170L, 3178L, 3180L, 3181L, 3182L, 3170L, 3170L, 
3169L, 3169L, 3164L, 3178L, 3178L, 3170L, 3097L, 3168L, 3177L, 
3097L, 3178L, 3164L, 3168L, 3097L, 3164L, 3168L, 3169L, 3177L, 
3179L, 3177L, 3179L, 3179L, 3097L, 3164L, 3168L, 3169L, 3170L, 
3177L, 3178L, 3179L, 3169L, 3169L, 3178L, 3178L, 3170L, 3170L, 
3170L, 3170L, 3178L, 3177L, 3169L, 3168L, 3168L, 3178L, 3097L, 
3168L, 3177L, 3177L, 3164L, 3097L, 3097L, 3164L, 3177L, 3097L, 
3164L, 3164L, 3179L, 3179L, 3179L, 3168L, 3169L, 3179L, 3170L, 
3169L, 3170L, 3178L, 3169L, 3178L, 3169L, 3178L, 3097L, 3170L, 
3097L, 3097L, 3177L, 3177L, 3177L, 3164L, 3164L, 3168L, 3168L, 
3164L, 3168L, 3179L, 3179L, 3179L, 3097L, 3164L, 3168L, 3169L, 
3170L, 3177L, 3178L, 3179L, 3169L, 3178L, 3170L, 3169L, 3097L, 
3178L, 3170L, 3170L, 3169L, 3178L, 3168L, 3164L, 3177L, 3177L, 
3097L, 3168L, 3168L, 3097L, 3177L, 3164L, 3164L, 3097L, 3164L, 
3168L, 3169L, 3170L, 3177L, 3178L, 3179L)), .Names = c("YEAR", 
"Entry_Number_F"), class = "data.frame", row.names = c(181L, 
182L, 183L, 184L, 185L, 186L, 187L, 188L, 189L, 190L, 191L, 192L, 
193L, 194L, 195L, 196L, 197L, 198L, 199L, 200L, 201L, 202L, 203L, 
204L, 205L, 206L, 207L, 208L, 209L, 210L, 211L, 212L, 213L, 214L, 
215L, 216L, 552L, 553L, 554L, 555L, 556L, 557L, 558L, 559L, 560L, 
561L, 562L, 563L, 564L, 565L, 566L, 567L, 568L, 569L, 570L, 571L, 
572L, 573L, 574L, 575L, 576L, 577L, 578L, 579L, 580L, 581L, 582L, 
583L, 584L, 585L, 586L, 587L, 984L, 985L, 986L, 987L, 988L, 989L, 
990L, 991L, 992L, 993L, 994L, 995L, 996L, 997L, 998L, 999L, 1000L, 
1001L, 1002L, 1003L, 1004L, 1005L, 1006L, 1007L, 1008L, 1009L, 
1010L, 1011L, 1012L, 1013L, 1014L, 1015L, 1016L, 1017L, 1018L, 
1019L, 1357L, 1358L, 1359L, 1360L, 1361L, 1362L, 1363L, 1364L, 
1365L, 1366L, 1367L, 1368L, 1369L, 1370L, 1371L, 1372L, 1373L, 
1374L, 1375L, 1376L, 1377L, 1378L, 1379L, 1380L, 1381L, 1382L, 
1383L, 1384L, 1385L, 1386L, 1387L, 1388L, 1389L, 1390L, 1391L, 
1392L, 4139L, 4140L, 4141L, 4142L, 4143L, 4144L, 4145L, 4146L, 
4147L, 4148L, 4149L, 4150L, 4151L, 4152L, 4153L, 4154L, 4155L, 
4156L, 4157L, 4158L, 4159L, 4160L, 4161L, 4162L, 4163L, 4164L, 
4165L, 4166L, 4167L, 4168L, 4169L, 4170L, 4444L, 4445L, 4446L, 
4447L, 4448L, 4449L, 4450L, 4451L, 4452L, 4453L, 4454L, 4455L, 
4456L, 4457L, 4458L, 4459L, 4460L, 4461L, 4462L, 4463L, 4464L, 
4465L, 4466L, 4467L, 4468L, 4469L, 4470L, 4471L, 4472L, 4473L, 
4474L, 4475L, 4968L, 4969L, 4970L, 4971L, 4972L, 4973L, 4974L, 
4975L, 4976L, 4977L, 4978L, 4979L, 4980L, 4981L, 4982L, 4983L, 
4984L, 4985L, 4986L, 4987L, 4988L, 4989L, 4990L, 4991L, 4992L, 
4993L, 4994L, 4995L, 4996L, 4997L, 4998L, 4999L, 5409L, 5410L, 
5411L, 5412L, 5413L, 5414L, 5415L, 5416L, 5417L, 5418L, 5419L, 
5420L, 5421L, 5422L, 5423L, 5424L, 5425L, 5426L, 5427L, 5428L, 
5429L, 5430L, 5431L, 5432L, 5433L, 5434L, 5435L, 5436L, 5437L
))

We now see the unique entries in each of the subsets.

unique(df2.2017$Entry_Number_F)
unique(df2.2016$Entry_Number_F)

The above shows that:

2017 has 9 unique entries 2016 has 8 unique entries They share 6 unique entries. Therefore: 6/11 = 0.6666667

The following returns this number as well.

length(Reduce(intersect, list(df2.2017$Entry_Number_F, unique(df2.2016$Entry_Number_F))))/length(unique(df2.2017$Entry_Number_F))

Does this make my question clearer? I can't provide more data due to space limitation, but if I had more years how would I compare all years with each other?

The example here, using crossprod(table(stack(l))) seems like a possibility but I’m not sure how to get the lists of unique entries in each year into lists.

A colleague who teaches R said "expand.grid, outer, and lapply are probably the way to go." but he didn't have time to elaborate.

G. Grothendieck · Accepted Answer · 2018-04-27T17:05:47.143

3

This does not give the same answer as your output but maybe it is really what you want anyways. Use length(y) in place of length(x) if you want the transpose and use s <- rev(s) if you want the years to appear in reverse order.

prop <- function(x, y) length(intersect(x, y)) / length(x)
s <- with(unique(df), split(Entry_Number_F, YEAR))
outer(s, s, Vectorize(prop))

giving:

          2013      2014      2015      2016      2017
2013 1.0000000 0.8000000 0.8000000 0.8000000 0.8000000
2014 0.8000000 1.0000000 1.0000000 1.0000000 1.0000000
2015 0.3636364 0.4545455 1.0000000 0.4545455 0.5454545
2016 0.5000000 0.6250000 0.6250000 1.0000000 0.8750000
2017 0.3636364 0.4545455 0.5454545 0.6363636 1.0000000

edited Apr 27 '18 at 17:05

answered Apr 27 '18 at 16:53

G. Grothendieck

254,981
17
203
341

Actually, this did it. – Nic George Apr 28 '18 at 00:00
prop <- function(x, y) length(Reduce(intersect, list(x, y)))/length(unique(x)) s <- with(unique(Data.input), split(Entry_Number_F, YEAR)) outer(s, s, Vectorize(prop)) – Nic George Apr 28 '18 at 00:00

score 0 · Answer 2 · answered Apr 27 '18 at 16:57

I'm also not sure if your desired output is what you really want, but here's a solution that uses dplyr and tidyr:

left_join(df, df, by = "Entry_Number_F") %>% 
    group_by(YEAR.x, YEAR.y) %>% 
    summarise(newcol = n() / sum(YEAR.x[1] == df$YEAR)) %>% 
    spread(YEAR.x, newcol)

# A tibble: 5 x 6
#   YEAR.y `2013` `2014` `2015` `2016` `2017`
#   <fct>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
# 1 2013    1.00   0.800  0.364  0.500  0.364
# 2 2014    0.800  1.00   0.455  0.625  0.455
# 3 2015    0.800  1.00   1.00   0.625  0.545
# 4 2016    0.800  1.00   0.455  1.00   0.636
# 5 2017    0.800  1.00   0.545  0.875  1.00

In R, how do I easily return the proportion of common unique elements between combinations of categories?

2 Answers2