1

I have a data set where I would like to return the most frequent entry in column b, for each unique entry of a, where both columns are character vectors. In the event where two entries in b are equally frequent for a unique entry in a, I would like to return both entries of b, in separate columns (desired output below).

This is similar to a question asked here, however the answers to that question all use tidyverse. I am tidyverse averse, as tidyverse objects break other things in my project. Looking for a base R solution (and looking to avoid a conversation about the merits of tidyverse).

My data look like this:

a <- as.character(c(rep(1:3,4)))
b <- c("A","A","A",
       "B","B","B",
       "A","B","A",
       "A","B","B")
df <- data.frame(a,b)

 a b
 1 A
 2 A
 3 A
 1 B
 2 B
 3 B
 1 A
 2 B
 3 A
 1 A
 2 B
 3 B

desired output:

 group match_1 match_2
     1       A    <NA>
     2       B    <NA>
     3       A       B
colin
  • 2,606
  • 4
  • 27
  • 57
  • "the answers to that question all use tidyverse": `by` is `base`. See also [Find most frequent combination of values in a data.frame](https://stackoverflow.com/questions/18570149/find-most-frequent-combination-of-values-in-a-data-frame?rq=1) – Henrik Nov 16 '18 at 16:36
  • @Henrik I am familiar with the `by` solution in the other question. However, even within that answer the responder admits its challenging to get the output of the `by` solution to look anything like the desired output. If you could modify that output to look like the desired output above, that would be helpful. – colin Nov 16 '18 at 16:37
  • Do you always have only two distinct values in column `b`? Otherwise "where two entries in `b` are equally frequent" would be ambiguous... – R Yoda Nov 16 '18 at 16:46
  • @RYoda apologies, can you suggest a re-wording? I really just want to consider the case where entry `3` in column `a` has an equal number of both `A` and `B` entries in column `b`. – colin Nov 16 '18 at 16:47

3 Answers3

3

Continuing from docendo discimus's answer:

library(dplyr)
# library(tidyr)
df %>%
  count(a, b) %>%
  group_by(a) %>%
  filter(n == max(n)) %>%
  mutate(r = row_number()) %>%
  tidyr::spread(r, b) %>%
  select(-n)
# # A tibble: 3 x 3
# # Groups:   a [3]
#   a     `1`   `2`  
#   <fct> <fct> <fct>
# 1 1     A     <NA> 
# 2 2     B     <NA> 
# 3 3     A     B    

And then you just need to rename the columns.

Base R variant:

reshape(do.call(rbind.data.frame, by(df, df$a, function(x) {
  tb <- table(x$b)
  tb <- tb[ tb == max(tb) ]
  data.frame(a = x$a[1], b = names(tb), r = seq_along(tb))
})), timevar = "r", idvar = "a", direction = "wide")
#     a b.1  b.2
# 1   1   A <NA>
# 2   2   B <NA>
# 3.1 3   A    B

I'll break it down, since not all of it may be intuitive:

The by function returns a list (specially formatted, but still just a list). If we look at a single instance of a, let's explore what happens. I'll skip to a == "3", since that's the one with repeats:

by(df, df$a, function(x) { browser(); 1; })
# Called from: FUN(data[x, , drop = FALSE], ...)
# Browse[1]> 
debug at #1: [1] 1
# Browse[2]> 
Called from: FUN(data[x, , drop = FALSE], ...)
# Browse[1]> 
debug at #1: [1] 1
# Browse[2]> 
Called from: FUN(data[x, , drop = FALSE], ...)
# Browse[1]> 
debug at #1: [1] 1
# Browse[2]> 
x
#    a b
# 3  3 A
# 6  3 B
# 9  3 A
# 12 3 B
# Browse[2]> 
( tb <- table(x$b) )
# A B 
# 2 2 

Alright, so we now have the count per-b. Realize that there might easily have been more here, say:

# A B C
# 2 2 1

so I'm going to reduce this named vector to just those with the highest value:

# Browse[2]> 
( tb <- tb[ tb == max(tb) ] ) # no change here, but had there been a third value in 'b' ...
# A B 
# 2 2 

Lastly, we want by to capture a data.frame (that we can later combine). We're guaranteed that a is one value potentially repeated, so a[1]; we have ensured that names(tb) has all "interesting" values, and the r is a helper for reshape, later:

# Browse[2]> 
data.frame(a = x$a[1], b = names(tb), r = seq_along(tb))
#   a b r
# 1 3 A 1
# 2 3 B 2

Now that we explored internally, let's wrap that up.

by(df, df$a, function(x) {
   tb <- table(x$b)
   tb <- tb[ tb == max(tb) ]
   data.frame(a = x$a[1], b = names(tb), r = seq_along(tb))
})
# df$a: 1
#   a b r
# 1 1 A 1
# ------------------------------------------------------------ 
# df$a: 2
#   a b r
# 1 2 B 1
# ------------------------------------------------------------ 
# df$a: 3
#   a b r
# 1 3 A 1
# 2 3 B 2

This looks awkward, but if you look under the hood (with dput), you'll see it's just a re-classed list. We can now combine them into a single frame with:

do.call(rbind.data.frame, by(df, df$a, function(x) {
  tb <- table(x$b)
  tb <- tb[ tb == max(tb) ]
  data.frame(a = x$a[1], b = names(tb), r = seq_along(tb))
}))
#     a b r
# 1   1 A 1
# 2   2 B 1
# 3.1 3 A 1
# 3.2 3 B 2

BTW: for both data.frame and rbind.data.frame, these are by default giving you factors. If you don't want them, then:

do.call(rbind.data.frame, c(by(df, df$a, function(x) {
  tb <- table(x$b)
  tb <- tb[ tb == max(tb) ]
  data.frame(a = x$a[1], b = names(tb), r = seq_along(tb),
             stringsAsFactors = FALSE)
}), stringsAsFactors=FALSE))
#     a b r
# 1   1 A 1
# 2   2 B 1
# 3.1 3 A 1
# 3.2 3 B 2

And then the reshaping. I admit that this is the most fragile (at least for me) part of it. I'm not a reshape-user, I tend towards tidyr::spread or data.table::dcast, but this is base-R and works for now. The use of reshape is a tutorial in and of itself, so I won't go into it here. There are numerous attempts to provide more-user-friendly reshaping tools out there (reshape2, tidyr, data.table all come to mind up front but are unlikely to be the only ones).

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Thanks for this, however I am specifically looking to avoid using tidyverse solutions. – colin Nov 16 '18 at 16:38
  • Also, when I run the above code it throws this error: `Error in UseMethod("groups") : no applicable method for 'groups' applied to an object of class "character"` – colin Nov 16 '18 at 16:41
  • Oops, first `a` should have been `df`, my copy/paste mistake. Tidyverse: noted, I was just continuing on an answer you accepted earlier. And I'm not arguing that you *should* use tidyverse, I understand your aversion. – r2evans Nov 16 '18 at 16:43
  • Thanks so much. is there a way to convert the output of the above code to a regular data.frame object? – colin Nov 16 '18 at 16:45
  • By simply wrapping it with `as.data.frame`. – Gopala Nov 16 '18 at 16:53
  • 1
    `as.data.frame`. There are generally two differences between a `data.frame` and a `tibble`. (1) `tibble`s don't preserve row names, as can be seen with `mtcars` versus `tbl_df(mtcars)`. (2) `class(mtcars)` is `"data.frame"`, whereas `class(tbl_df(mtcars))` is `c("tbl_df","tbl","data.frame")`, *all other differences* are handled in printing it to the console. So if your data has no row names, the only difference is the addition of `tbl_df` and `tbl` to the class. – r2evans Nov 16 '18 at 16:53
  • Got it, thank you. Going to accept wrapping this answer in `as.data.frame()`, if no other answer written entirely in base returns the desired format. akrun is close below, but the output isn't as desired above. – colin Nov 16 '18 at 16:58
  • 1
    If you move `table` and `seq_along` outside `by`, you don't have to run them for every group. `d <- as.data.frame(table(df))`; `d2 <- do.call(rbind, by(d, d$a, function(x){ data.frame(a = x$a[1], nm = x$b[x$Freq == max(x$Freq)]) }))`; `d2$time <- ave(d2$a, d2$a, FUN = seq_along)`; `reshape(d2, idvar = "a", direction = "wide")` – Henrik Nov 16 '18 at 17:14
  • @Henrik, I like cleaning up the inside of `by`, but does that approach actually save much? – r2evans Nov 16 '18 at 17:18
  • 1
    If speed was an issue, we would probably do this is `data.table` (which doesn't "break" things, which was OP's arguments against `tidyverse`) ;) +1 for a nice, thorough answer. – Henrik Nov 16 '18 at 17:22
  • `data.table` was my first lean (as I'm learning it more these days), but the related previous answer was in `tidyr`, and OP specifically requested base. It took me a while to start to learn and appreciate the power of `data.table`, so I can understand the OP's misgivings. Plus I see package-bloat in many answers and projects that really is not necessary. Thanks! – r2evans Nov 16 '18 at 17:27
2

Another base alternative.

Create contingency table and convert to data frame: as.data.frame(table(df)).

Use ave to select rows with max values by group.

Use ave to create a 'time' variable to "differentiate multiple records from the same group" (see ?reshape).

reshape relevant variables to wide.

d <- as.data.frame(table(df))
d2 <- d[d$Freq == ave(d$Freq, d$a, FUN = max), ]
d2$time <- ave(d2$a, d2$a, FUN = seq_along)
reshape(d2[ , c("a", "b", "time")], idvar = "a", direction = "wide")

#   a b.1  b.2
# 1 1   A <NA>
# 3 3   A    B
# 5 2   B <NA>

If desired, order by 'a'.

Henrik
  • 65,555
  • 14
  • 143
  • 159
0

We can do this in base R

tbl <- table(df)
ifelse(tbl[,1] == tbl[,2], toString(colnames(tbl)), colnames(tbl)[max.col(tbl)])
akrun
  • 874,273
  • 37
  • 540
  • 662