Retrieve the most repeated (x, y) values in two columns in a data frame

Question

I am storing (x, y) values in a dataframe. I want to return the most frequently appearing (x, y) combination.

Here is an example:

> x = c(1, 1, 2, 3, 4, 5, 6)
> y = c(1, 1, 5, 6, 9, 10, 12)
> xy = data.frame(x, y)
> xy
  x  y
1 1  1
2 1  1
3 2  5
4 3  6
5 4  9
6 5 10
7 6 12

The most common (x, y) value would be (1, 1).

I tried the answer here for a single column. It works for a single column, but does not work for an aggregate of two columns.

> tail(names(sort(table(xy$x))), 1)
[1] "1"
> tail(names(sort(table(xy$x, xy$y))), 1)
NULL

How do I retrieve the most repeated (x, y) values in two columns in a data frame in R?

EDIT: c(1, 2) should be considered distinct from c(2, 1).

This may be a duplicate of http://stackoverflow.com/questions/18570149/find-most-frequent-combination-of-values-in-a-data-frame - though there is a wider set of answers to this post already. — Sam Firke, Apr 28 '15 at 13:36
The question is not clear. What happens if a line contains `c(1,2)` and another `c(2,1)` ? Are these two lines pointing to the same couple? — Colonel Beauvel, Apr 28 '15 at 13:37
@ColonelBeauvel `c(1, 2)` should be considered distinct from `c(2, 1)`. — user4605941, Apr 28 '15 at 13:38
@Khashaa I think `table(do.call(paste, xy))` is better, but I already posted it... — David Arenburg, Apr 28 '15 at 13:41

David Arenburg · Answer 1 · 2015-04-28T14:08:34.987

11

Not sure how will the desired output should look like, but here's a possible solution

res <- table(do.call(paste, xy))
res[which.max(res)]
# 1 1 
#   2

In order to get the actual values, one could do

res <- do.call(paste, xy) 
xy[which.max(ave(seq(res), res, FUN = length)), ]
#   x y
# 1 1 1

edited Apr 28 '15 at 14:08

answered Apr 28 '15 at 13:33

David Arenburg

91,361
17
137
196

Would you also add to your answer how to retrieve each component of `res[which.max(res)]`? Such as how to get the `x` value and the `y` value. – user4605941 Apr 28 '15 at 13:44
Actually, disregard that edit. I did a mistake. Let me think it thru. – David Arenburg Apr 28 '15 at 13:52
Thanks! `df = xy[which.max(res), ]` stores a data frame in `df`. The `x` and `y` values can be accessed by `df$x` and `df$y`. – user4605941 Apr 28 '15 at 13:52
It's not quiite that simple, because `res` is <= `xy` – David Arenburg Apr 28 '15 at 13:53
Oh, I see now. I'll withdraw the check mark for now. – user4605941 Apr 28 '15 at 14:02
Ok, I've made an edit, but my guess that Matrins answer will be the best one (as always). – David Arenburg Apr 28 '15 at 14:09
1

Okay, I've accepted Martin's answer since it is very, very good (though your answer is very good too). Thank you for your help! – user4605941 Apr 28 '15 at 14:25

Martin Morgan · Accepted Answer · 2015-04-28T17:32:06.080

(Despite all the plus votes, a hybrid of @DavidArenburg and my approaches

res = do.call("paste", c(xy, sep="\r"))
which.max(tabulate(match(res, res)))

might be simple and effective.)

Maybe it seems a little round-about, but a first step is to transform the possibly arbitrary values in the columns of xy to integers ranging from 1 to the number of unique values in the column

x = match(xy[[1]], unique(xy[[1]]))
y = match(xy[[2]], unique(xy[[2]]))

Then encode the combination of columns to unique values

v = x + (max(x) - 1L) * y

Indexing minimizes the range of values under consideration, and encoding reduces a two-dimensional problem to a single dimension. These steps reduce the space required of any tabulation (as with table() in other answers) to the minimum, without creating character vectors.

If one wanted to most common occurrence in a single dimension, then one could index and tabulate v

tbl = tabulate(match(v, v))

and find the index of the first occurrence of the maximum value(s), e.g.,

df[which.max(tbl),]

Here's a function to do the magic

whichpairmax <- function(x, y) {
    x = match(x, unique(x)); y = match(y, unique(y))
    v = x + (max(x) - 1L) * y
    which.max(tabulate(match(v, v)))
}

and a couple of tests

> set.seed(123)
> xy[whichpairmax(xy[[1]], xy[[2]]),]
  x y
1 1 1
> xy1 = xy[sample(nrow(xy)),]
> xy1[whichpairmax(xy1[[1]], xy1[[2]]),]
  x y
1 1 1
> xy1
  x  y
3 2  5
5 4  9
7 6 12
4 3  6
6 5 10
1 1  1
2 1  1

For an arbitrary data.frame

whichdfmax <- function(df) {
    v = integer(nrow(df))
    for (col in df) {
        col = match(col, unique(col))
        v = col + (max(col) - 1L) * match(v, unique(v))
    }
    which.max(tabulate(match(v, v)))
}

I wonder how that would work if one would like to do this for lets say 10 columns at once. — David Arenburg, Apr 28 '15 at 14:44
@DavidArenburg actually I'd be tempted to take your route (I think also by `duplicated.data.frame()`) of pasting the columns together then `tabulate(match(res, res))`, otherwise perhaps creating v by iterating over columns. — Martin Morgan, Apr 28 '15 at 14:54

score 7 · Answer 3 · answered Apr 28 '15 at 13:30

7

Try

library(data.table)
setDT(xy)[, .N,list(x,y)][which.max(N)]
#   x y N
#1: 1 1 2

answered Apr 28 '15 at 13:30

akrun

874,273
37
540
662

Looks like you won the speed test. – Carl Witthoft Apr 30 '15 at 13:02

mucio · Answer 4 · 2015-04-28T15:14:55.220

4

t<-table(xy)
which(t == max(t), arr.ind = TRUE)

Update:

As pointed out by David Arenburg, the initial code returned just the index of the values from the table(xy) function. If you need the values and maybe the number of occurrences of the max couple you can try this:

t<-table(xy)
indexes <- which(t == max(t), arr.ind = TRUE)[1,]
x_value <- dimnames(t)$x[indexes["x"]]
y_value <- dimnames(t)$y[indexes["y"]]
rep_number <- max(t)

Now I suspect there is better way to write the last three lines of code, but I'm still new to the R world

edited Apr 28 '15 at 15:14

answered Apr 28 '15 at 13:39

mucio

7,014
1
21
33

I think this will give wrong results for lets say `x = c(1, 2, 2, 5, 4, 4, 5) ;y = c(1, 1, 5, 10, 9, 2, 10) ;xy = data.frame(x, y)` – David Arenburg Apr 28 '15 at 13:55
@DavidArenburg It is giving the index `4, 5` of `t`. You are right that the results are not the expected as per the OP's post, but the index is correct, right? – akrun Apr 28 '15 at 15:02
@akrun I don't have time to validate right now, but I trust your judgment :) – David Arenburg Apr 28 '15 at 15:03
indexes are related to the table(xy), they should be correct – mucio Apr 28 '15 at 15:15
Perhaps `data.frame(x=rownames(t)[row(t)], y=colnames(t)[col(t)])[t==max(t),]` – akrun Apr 28 '15 at 15:30

score 3 · Answer 5 · answered Apr 28 '15 at 13:29

3

What about this?

x = c(1, 1, 2, 3, 4, 5, 6)
y = c(1, 1, 5, 6, 9, 10, 12)
xy = data.frame(x, y)

table(xy)
y
x   1 5 6 9 10 12
1 2 0 0 0  0  0
2 0 1 0 0  0  0
3 0 0 1 0  0  0
4 0 0 0 1  0  0
5 0 0 0 0  1  0
6 0 0 0 0  0  1

answered Apr 28 '15 at 13:29

Jaehyeon Kim

1,328
11
16

score 3 · Answer 6 · answered Apr 28 '15 at 13:31

3

library(data.table)
DT <- data.table(xy)
tail(DT[, Count := .N, by = c("x", "y")][ order(Count) ], 1)
    x y Count
 1: 1 1     2

answered Apr 28 '15 at 13:31

mlegge

6,763
3
40
67

score 1 · Answer 7 · answered Apr 28 '15 at 13:33

1

library(dplyr)
xy %>%
  group_by(x, y) %>%
  tally() %>%
  ungroup %>%
  top_n(1)

answered Apr 28 '15 at 13:33

Sam Firke

21,571
9
87
105

Also needed: `library(magrittr)` – Carl Witthoft Apr 29 '15 at 15:00
Loading dplyr should be sufficient for this example; does this not run for you as is? "dplyr only exports %>% from magrittr, but magrittr contains many other useful functions. To use them, load magrittr explicitly with library(magrittr)." http://blog.rstudio.org/2014/05/21/dplyr-0-2/ – Sam Firke Apr 29 '15 at 15:08
To be honest, I didn't try it. I was unaware that `dplyr` exports the pipe function. Thanks for the info. – Carl Witthoft Apr 29 '15 at 15:12

score 1 · Answer 8 · answered Apr 28 '15 at 13:33

1

With dplyr

library(dplyr)

xy %>% group_by(x, y) %>% summarise(n=n()) %>% 
   ungroup %>% filter(n==max(n)) %>% select(-n)

answered Apr 28 '15 at 13:33

dimitris_ps

5,849
3
29
55

score 1 · Answer 9 · answered Apr 30 '15 at 13:02

Late to the party, but here's a time test:

x<-sample(1:10,1e5,rep=TRUE)
y<-sample(1:10,1e5,rep=TRUE)


martin  <- function(x, y) {
    x = match(x, unique(x)); y = match(y, unique(y))
    v = x + (max(x) - 1L) * y
    which.max(tabulate(match(v, v)))
}
akrun <-function(x,y) {
    library(data.table)
    xy<-data.frame(x,y)
setDT(xy)[, .N,list(x,y)][which.max(N)]
}
mucio <-function(x,y){
    xy<-data.frame(x,y)
    t<-table(xy)
indexes <- which(t == max(t), arr.ind = TRUE)[1,]
x_value <- dimnames(t)$x[indexes["x"]]
y_value <- dimnames(t)$y[indexes["y"]]
rep_number <- max(t)

}

sam<-function(x,y){
    library(dplyr)
    xy<-data.frame(x,y)
xy %>%
  group_by(x, y) %>%
  tally() %>%
  ungroup %>%
  top_n(1)

}
dimitris<-function(x,y){
    library(dplyr)
xy<-data.frame(x,y)
xy %>% group_by(x, y) %>% summarise(n=n()) %>% 
   ungroup %>% filter(n==max(n)) %>% select(-n)

}

microbenchmark(martin(x,y),akrun(x,y),mucio(x,y),sam(x,y),dimitris(x,y),times=5)

Unit: milliseconds
           expr       min        lq       mean    median         uq
   martin(x, y) 11.727217 14.246913  41.359218 14.384385  82.639796
    akrun(x, y)  4.426462  4.613420   4.866548  4.892432   5.011406
    mucio(x, y) 73.938586 74.037568 103.941459 79.516207 145.232870
      sam(x, y)  8.356426  8.586212   8.919787  8.586521   8.775792
 dimitris(x, y)  8.618394  8.738228   9.252105  9.063965   9.075298
        max neval cld
  83.797780     5  a 
   5.389018     5  a 
 146.982062     5   b
  10.293983     5  a 
  10.764640     5  a

score 1 · Answer 10 · answered Jun 23 '15 at 21:15

Using sqldf:

library(sqldf)    
sqldf('SELECT x, y 
          FROM xy 
          GROUP BY (x||y) 
          ORDER BY COUNT(*) DESC 
          LIMIT 1')
  x y
1 1 1

If we'd like to show a frequency column, and not just one row (in case there are any ties):

x = c(1, 1, 2, 3, 4, 12, 12)
y = c(1, 1, 5, 6, 9, 12, 12)
xy = data.frame(x, y)

sqldf('SELECT x, y, COUNT(*) AS freq
      FROM xy 
      GROUP BY (x||y) 
      ORDER BY COUNT(*) DESC')

   x  y freq
1  1  1    2
2 12 12    2
3  2  5    1
4  3  6    1
5  4  9    1

Retrieve the most repeated (x, y) values in two columns in a data frame

10 Answers10