Exchange data.table columns with most prevalent value of columns

Question

I have data

test = data.table(
  a = c(1,1,3,4,5,6), 
  b = c("a", "be", "a", "c", "d", "c"), 
  c = rep(1, 6)
)

I wish to take the unique values of column a, store it in another data.table, and afterwards fill in the remaining columns with the most prevalent values of those remaining columns, such that my resulting data.table would be:

test2 = data.table(a = c(1,3,4,5,6), b = "a", c = 1)

Column be has equal amounts of "a" and "c", but it doesn't matter which is chosen in those cases.

Attempt so far:

test2 = unique(test, by = "a")
test2[, c("b", "c") := lapply(.SD, FUN = function(x){test2[, .N, by = x][order(-N)][1,1]}), .SDcols = c("b", "c")]

EDIT: I would preferrably like a generic solution that is compatible with a function where I specify the column to be "uniqued", and the rest of the columns are with the single most prevalent value. Hence my use of lapply and .SD =)

EDIT2: as @MichaelChirico points out, how do we keep the class. With the following data.table some of the solutions does not work, although solution of @chinsoon12 does work:

test = data.table(a = c(1,1,3,4,5,6), 
                  b = c("a", "be", "a", "c", "d", "c"), 
                  c = rep(1, 6),
                  d = as.Date("2019-01-01"))

score 3 · Accepted Answer · answered Feb 14 '20 at 22:56

3

Another option:

dtmode <- function(x) x[which.max(rowid(x))]
test[, .(A=unique(A), B=dtmode(B), C=dtmode(C))]

data:

test = data.table(
    A = c(1,1,3,4,5,6), 
    B = c("a", "be", "a", "c", "d", "c"), 
    C = rep(1, 6)
)

answered Feb 14 '20 at 22:56

chinsoon12

25,005
4
25
35

ah, this is very elegant, thank you! I didn't know of rowid() – Berthrand Eros Feb 15 '20 at 11:35

aekiratli · Answer 2 · 2020-02-14T15:42:07.300

2

Not a clean way to do this but it works.

test = data.frame(a = c(1,1,3,4,5,6), b = c("a", "be", "a", "c", "d", "c"), c = rep(1, 6))

a = unique(test$a)
b = tail(names(sort(table(test$b))), 1)
c = tail(names(sort(table(test$c))), 1)

test2 = cbind(a,b,c)

Output is like this:

> test2
     a   b   c  
[1,] "1" "c" "1"
[2,] "3" "c" "1"
[3,] "4" "c" "1"
[4,] "5" "c" "1"
[5,] "6" "c" "1"
>

edited Feb 14 '20 at 15:42

answered Feb 14 '20 at 15:33

aekiratli

518
7
18

Thanks for helping! This is unfortunately not what I'm looking for. I do not wish to hardcode anything =) – Berthrand Eros Feb 14 '20 at 15:40
2

@MichaelChirico: Sorry, that was not very clear, I agree. I mean: 1) a solution where variable names are not explicitly written in code, like line 2, 3 and 4. 2) a solution that will work for more columns than I have stated in my example, i.e. I want a data.table solution with .SD + .SDcols – Berthrand Eros Feb 15 '20 at 15:08

MichaelChirico · Answer 3 · 2020-02-15T12:28:31.523

2

@EmreKiratli is very close to what I would do:

test[ , c(
  list(a = unique(a)),
  lapply(.SD, function(x) as(tail(names(sort(table(x))), 1L), class(x)))
), .SDcols = !'a']

The as(., class(x)) part is because names in R are always character, so we have to convert back to the original class of x.

You might like this better in magrittr form since it's many nested functions:

library(magrittr)
test[ , c(
  list(a = unique(a)),
  lapply(.SD, function(x) {
    table(x) %>% sort %>% names %>% tail(1L) %>% as(class(x))
  })
), .SDcols = !'a']

edited Feb 15 '20 at 12:28

answered Feb 15 '20 at 12:21

MichaelChirico

33,841
14
113
198

1

Thank you very much! Very good point about the class. None of our codes actually work when we have a date column in data.table *test*. test = data.table(a = c(1,1,3,4,5,6), b = c("a", "be", "a", "c", "d", "c"), c = rep(1, 6), d = as.Date("2019-01-01")) – Berthrand Eros Feb 15 '20 at 15:03

Berthrand Eros · Answer 4 · 2020-02-14T17:18:26.600

I was able to make an OK solution, but if somebody can do it more elegantly, for example not going through the step of storting a list in refLevel below, please let me know! I'm very interested in learning data.table properly!

#solution:
test = data.table(a = c(1,1,3,4,5,6), b = c("a", "be", "a", "c", "d", "c"), c = rep(1, 6))
test2 = unique(test, by="a")
funPrev = function(x){unlist(as.data.table(x)[, .N, by=x][order(-N)][1,1], use.names = F)}
refLevel = lapply(test[, c("b", "c")], funPrev)
test2[, c("b", "c") := refLevel]

...and using a function (if anybody see any un-necessary step, please let me know):

genData = function(dt, var_unique, vars_prev){

  data = copy(dt)
  data = unique(data, by = var_unique)

  funPrev = function(x){unlist(as.data.table(x)[, .N, by=x][order(-N)][1,1], use.names = F)}

  refLevel = lapply(dt[, .SD, .SDcols = vars_prev], funPrev)
  data[, (vars_prev) := refLevel] 

  return(data)
}
test2 = genData(test, "a", c("b", "c"))

score 1 · Answer 5 · answered Feb 17 '20 at 13:18

Here's another variant which one might find less sophisticated, yet more readable. It's essentially chinsoon12's rowid approach generalized for any number of columns. Also the classes are kept.

test = data.table(a = c(1,1,3,4,5,6), 
                  b = c("a", "be", "a", "c", "d", "c"), 
                  c = rep(1, 6),
                  d = as.Date("2019-01-01"))
test2 = unique(test, by = "a")
for (col in setdiff(names(test2), "a")) test2[[col]] = test2[[col]][which.max(rowid(test2[[col]]))]

Exchange data.table columns with most prevalent value of columns

5 Answers5