What is the fastest way to reduce elements of a list by frequency?

Question

Suppose that I have a list similar to this one:

set.seed(12731)
out <- lapply(1:sample.int(10, 1), function(x){sample(letters[1:4], x, replace = T)})

[[1]]
[1] "b"

[[2]]
[1] "d" "c"

[[3]]
[1] "b" "a" "a"

[[4]]
[1] "d" "d" "b" "c"

[[5]]
[1] "d" "d" "c" "c" "b"

[[6]]
[1] "b" "d" "b" "d" "c" "c"

[[7]]
[1] "a" "b" "d" "d" "b" "a" "d"

I would like to have vectors of length one given by the element of higher frequency in the list. Notice that is possible to have vectors of length > 1 if there are no duplicates. The frequency table is like this:

table(unlist(out))[order(table(unlist(out)), decreasing = T)]

 b  c  d  a 
16 14 13 12

The outcome of the example is something like this:

list("b", "c", "b", "b", "b", "b", "b")

REMARK It is possible to have vectors of length > 1 if there are no duplicates.

out <- lapply(1:sample.int(10, 1), function(x){sample(letters[1:4], x, replace = T)})
length(out)
[1] 10
out[[length(out)+1]] <- c("L", "K")
out
[[1]]
[1] "c"

[[2]]
[1] "d" "a"

[[3]]
[1] "c" "b" "a"

[[4]]
[1] "b" "c" "b" "c"

[[5]]
[1] "a" "a" "d" "c" "d"

[[6]]
[1] "d" "b" "d" "d" "d" "a"

[[7]]
[1] "d" "b" "c" "c" "d" "c" "a"

[[8]]
[1] "d" "a" "d" "b" "d" "a" "b" "d"

[[9]]
[1] "a" "b" "b" "b" "c" "c" "a" "c" "d"

[[10]]
 [1] "d" "d" "d" "a" "d" "d" "c" "c" "a" "c"

[[11]]
[1] "L" "K"

Expected outcome:

list("c", "d", "c", "c", "d", "d", "d", "d", "d", "d", c("L", "K"))

And to make it reproducible so that everyone trying to solve can work on the same set of data and compare solutions. — d.b, Feb 21 '17 at 17:55
Shouldn't you have 7 elements in the output vector since you have 7 elements in the list? — Sandipan Dey, Feb 21 '17 at 18:00
You are absolutely right, the expected outcome should have the same length as the input list. I corrected that, it was a typo, sorry. — Mario GS, Feb 21 '17 at 18:06
Arguable dupe of http://stackoverflow.com/questions/2547402/is-there-a-built-in-function-for-finding-the-mode — Frank, Feb 21 '17 at 19:24
@Frank. I think it's more complicated. The OP wants to 1 Calculate the frequency of distinct elements across all list items (which includes the mode) 2 Find the element within each list item that has the highest frequency count calculated by 1. 3 also take account of ties. — lmo, Feb 21 '17 at 19:34
@lmo It looks like `lapply(List, Mode, ties="all")` for some mode function with that option. To me, `lapply(List, f)` is not really a different q from how to make `f`. — Frank, Feb 21 '17 at 19:35
@Frank But it's not the mode of each individual list item (vector) that is relevant, but rather the frequency ranking (mode) of the entire unlisted vector, `unlist(List)` in your example. Then, values are pulled out of each list item based on this ranking. — lmo, Feb 21 '17 at 19:40
@lmo Ok. I guess it's more complicated/hard to understand than I thought. — Frank, Feb 21 '17 at 19:42

lmo · Accepted Answer · 2017-02-21T19:19:49.183

I believe that this should work for what you are looking for.

# get counts for entire list and order them
myRanks <- sort(table(unlist(out)), decreasing=TRUE)

This produces

myRanks

 b  c  d  a 
10  9  5  4 


# calculate if most popular, then second most popular, ... item shows up for each list item
sapply(out, function(i) names(myRanks)[min(match(i, names(myRanks)))])
[1] "b" "b" "b" "c" "b" "b" "b"

Here, sapply runs through each list item and returns a vector. It applies a function that selects the name of the first element (via min) of the myRanks table that appears in the list element, using match.

In the case of multiple elements having the same count (duplicates) in the myRanks table, the following code should to return a list of the top observations per list item:

sapply(out,
     function(i) {
       intersect(names(myRanks)[myRanks == max(unique(myRanks[match(i, names(myRanks))]))],
                 i)})

Here, the names of myRanks that have the same value as the value in the list item with the highest value in myRanks are intersected with the names present in the list item in order to only return values in both sets.

I really like your answer is clean, but could you please read my remark. — Mario GS, Feb 21 '17 at 18:39
I managed to do it like this too: myRanks <- sort(table(unlist(out)), decreasing=TRUE) sapply(out, function(i){inter <- myRanks [intersect(i, names(myRanks))] names(which(inter==max(inter))) } ) — Mario GS, Feb 21 '17 at 19:25

Sandipan Dey · Answer 2 · 2017-02-22T05:41:35.610

This should work:

set.seed(12731)
out <- lapply(1:sample.int(10, 1), function(x){sample(letters[1:4], x, replace = T)})
out
#[[1]]
#[1] "b"

#[[2]]
#[1] "c" "b"

#[[3]]
#[1] "b" "b" "b"

#[[4]]
#[1] "d" "c" "c" "d"

#[[5]]
#[1] "d" "b" "a" "a" "c"

#[[6]]
#[1] "a" "b" "c" "b" "c" "c"

#[[7]]
#[1] "a" "c" "d" "b" "d" "c" "b"

tbl <- table(unlist(out))[order(table(unlist(out)), decreasing = T)]
sapply(out, function(x) intersect(names(tbl), x)[1])
# [1] "b" "b" "b" "c" "b" "b" "b"

[EDIT]

set.seed(12731)
out <- lapply(1:sample.int(10, 1), function(x){sample(letters[1:4], x, replace = T)})
out[[length(out)+1]] <- c("L", "K")
out
#[[1]]
#[1] "b"

#[[2]]
#[1] "c" "b"

#[[3]]
#[1] "b" "b" "b"

#[[4]]
#[1] "d" "c" "c" "d"

#[[5]]
#[1] "d" "b" "a" "a" "c"

#[[6]]
#[1] "a" "b" "c" "b" "c" "c"

#[[7]]
#[1] "a" "c" "d" "b" "d" "c" "b"

#[[8]]
#[1] "L" "K"

tbl <- table(unlist(out))[order(table(unlist(out)), decreasing = T)]

#tbl
#b  c  d  a  K  L 
#10  9  5  4  1  1 

lapply(out, function(x) names(tbl[tbl==max(tbl[names(tbl) %in%  intersect(names(tbl), x)])]))

#[[1]]
#[1] "b"

#[[2]]
#[1] "b"

#[[3]]
#[1] "b"

#[[4]]
#[1] "c"

#[[5]]
#[1] "b"

#[[6]]
#[1] "b"

#[[7]]
#[1] "b"

#[[8]]
#[1] "K" "L"

TX, but this is not what I meant. The frequency has to be calculated with all the elements of the list, not single elements. Just the way I computed it in the example. — Mario GS, Feb 21 '17 at 18:10
This is almost what I expected, could you please read my remark? — Mario GS, Feb 21 '17 at 18:38

What is the fastest way to reduce elements of a list by frequency?

2 Answers2