Continuing from docendo discimus's answer:
library(dplyr)
# library(tidyr)
df %>%
count(a, b) %>%
group_by(a) %>%
filter(n == max(n)) %>%
mutate(r = row_number()) %>%
tidyr::spread(r, b) %>%
select(-n)
# # A tibble: 3 x 3
# # Groups: a [3]
# a `1` `2`
# <fct> <fct> <fct>
# 1 1 A <NA>
# 2 2 B <NA>
# 3 3 A B
And then you just need to rename the columns.
Base R variant:
reshape(do.call(rbind.data.frame, by(df, df$a, function(x) {
tb <- table(x$b)
tb <- tb[ tb == max(tb) ]
data.frame(a = x$a[1], b = names(tb), r = seq_along(tb))
})), timevar = "r", idvar = "a", direction = "wide")
# a b.1 b.2
# 1 1 A <NA>
# 2 2 B <NA>
# 3.1 3 A B
I'll break it down, since not all of it may be intuitive:
The by
function returns a list
(specially formatted, but still just a list). If we look at a single instance of a
, let's explore what happens. I'll skip to a == "3"
, since that's the one with repeats:
by(df, df$a, function(x) { browser(); 1; })
# Called from: FUN(data[x, , drop = FALSE], ...)
# Browse[1]>
debug at #1: [1] 1
# Browse[2]>
Called from: FUN(data[x, , drop = FALSE], ...)
# Browse[1]>
debug at #1: [1] 1
# Browse[2]>
Called from: FUN(data[x, , drop = FALSE], ...)
# Browse[1]>
debug at #1: [1] 1
# Browse[2]>
x
# a b
# 3 3 A
# 6 3 B
# 9 3 A
# 12 3 B
# Browse[2]>
( tb <- table(x$b) )
# A B
# 2 2
Alright, so we now have the count per-b
. Realize that there might easily have been more here, say:
# A B C
# 2 2 1
so I'm going to reduce this named vector to just those with the highest value:
# Browse[2]>
( tb <- tb[ tb == max(tb) ] ) # no change here, but had there been a third value in 'b' ...
# A B
# 2 2
Lastly, we want by
to capture a data.frame
(that we can later combine). We're guaranteed that a
is one value potentially repeated, so a[1]
; we have ensured that names(tb)
has all "interesting" values, and the r
is a helper for reshape
, later:
# Browse[2]>
data.frame(a = x$a[1], b = names(tb), r = seq_along(tb))
# a b r
# 1 3 A 1
# 2 3 B 2
Now that we explored internally, let's wrap that up.
by(df, df$a, function(x) {
tb <- table(x$b)
tb <- tb[ tb == max(tb) ]
data.frame(a = x$a[1], b = names(tb), r = seq_along(tb))
})
# df$a: 1
# a b r
# 1 1 A 1
# ------------------------------------------------------------
# df$a: 2
# a b r
# 1 2 B 1
# ------------------------------------------------------------
# df$a: 3
# a b r
# 1 3 A 1
# 2 3 B 2
This looks awkward, but if you look under the hood (with dput
), you'll see it's just a re-classed list
. We can now combine them into a single frame with:
do.call(rbind.data.frame, by(df, df$a, function(x) {
tb <- table(x$b)
tb <- tb[ tb == max(tb) ]
data.frame(a = x$a[1], b = names(tb), r = seq_along(tb))
}))
# a b r
# 1 1 A 1
# 2 2 B 1
# 3.1 3 A 1
# 3.2 3 B 2
BTW: for both data.frame
and rbind.data.frame
, these are by default giving you factor
s. If you don't want them, then:
do.call(rbind.data.frame, c(by(df, df$a, function(x) {
tb <- table(x$b)
tb <- tb[ tb == max(tb) ]
data.frame(a = x$a[1], b = names(tb), r = seq_along(tb),
stringsAsFactors = FALSE)
}), stringsAsFactors=FALSE))
# a b r
# 1 1 A 1
# 2 2 B 1
# 3.1 3 A 1
# 3.2 3 B 2
And then the reshaping. I admit that this is the most fragile (at least for me) part of it. I'm not a reshape
-user, I tend towards tidyr::spread
or data.table::dcast
, but this is base-R and works for now. The use of reshape
is a tutorial in and of itself, so I won't go into it here. There are numerous attempts to provide more-user-friendly reshaping tools out there (reshape2
, tidyr
, data.table
all come to mind up front but are unlikely to be the only ones).