-1

I'm summarizing data in a data.table, group by, where I need to take a single value of a variable in a group. I want this value to be the mode of the group. I think it needs to be mode because usually a group is 8 rows and it will have 2 rows at one value and the other 6 or so rows will be another value.

Here's a simplified example, from this:

key1 2
key1 2
key1 2
key1 8
key1 2
key1 2
key1 2
key1 8

I want this:

key1 2

I was having trouble using the standard mode function provided by base R, so I used this solution here: Most frequent value (mode) by group

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

It worked great on my small test data set, but when I run it on my actual data set (22 million rows) it just runs and runs and runs. All my other data.table operations that are similar work great and really fast, but I'm not using a UDF. This is the structure of my data.table query:

ModeCharacterColumns <- ExposureHistory[,lapply(.SD,Mode), .(Key1=Key1, Key2=Key2, ..., key7=key7, key8=key8), .SDcols=('col1','col2','col3', ..., 'col53')]

So I'm guessing my problem is that my UDF is really slowing things down, does anyone have any suggestions where I can accomplish the same goal but get it done much quicker?

Thank you everyone!

EDIT: Better representation of the data:

DT <- fread("key1A key2A key3A key4A 2 2 4 s
             key1A key2A key3A key4A 2 2 4 s  
             key1A key2A key3A key4A 8 8 8 t
             key1A key2A key3A key4A 2 2 4 s
             key1B key2B key3B key4B 6 6 6 v
             key1B key2B key3B key4B 2 2 5 t
             key1B key2B key3B key4B 2 2 5 v
             key1B key2B key3B key4B 2 2 5 v")

And the desired result:

result <- fread("key1A key2A key3A key4A 2 2 4 s
                 key1B key2B key3B key4B 2 2 5 v")
Community
  • 1
  • 1
Factuary
  • 43
  • 1
  • 8
  • 1
    Based on this code, `,lapply(.SD,Mode), .(Key1=Key1, Key2=Key2, [...]`, your actual `data.table` seems to be much different than the example you provided. Can you include something more representative (i.e. not a two-column object, but a scaled-down version of your real data)? – nrussell Apr 28 '16 at 13:59
  • Hi @nrussel, my main machine's motherboard crashed. Just back up today. A better data sample: `DT <- fread("key1A key2A key3A key4A 2 2 4 s key1A key2A key3A key4A 2 2 4 s key1A key2A key3A key4A 8 8 8 t key1A key2A key3A key4A 2 2 4 s key1B key2B key3B key4B 6 6 6 v key1B key2B key3B key4B 2 2 5 t key1B key2B key3B key4B 2 2 5 v key1B key2B key3B key4B 2 2 5 v")` And this is the expected result: `result <- fread("key1A key2A key3A key4A 2 2 4 s key1B key2B key3B key4B 2 2 5 v")` – Factuary May 03 '16 at 22:39

2 Answers2

5

Try using data.table to tabulate the data:

DT <- fread("key1 8
             key1 2
             key1 2
             key1 8
             key1 2
             key1 2
             key1 2
             key1 8")

setkeyv(
  DT[, .N, by = .(V1, V2)], #tabulate
  c("V1", "N") #sort by N
   )[, .(Mode = V2[.N]), by = V1] #most frequent value by V1
#     V1 Mode
#1: key1    2

You need to consider tie-breaking carefully. I might actually use a for loop to apply this to more value columns, but you'd need to provide a representative reproducible example if you want me to try that.

Edit:

Frank provides one option of doing this for several value columns in a comment:

DT[, lapply(.SD, function(x) setDT(list(x = x))[, .N, by=x][order(-N)][1L, x]), by=V1]

However, I believe this copies every value column, which might slow it down too much.

Roland
  • 127,288
  • 10
  • 191
  • 288
  • 1
    For more cols `DT[, lapply(.SD, function(x) setDT(list(x = x))[, .N, by=x][order(-N)][1L, x]), by=V1]`, but I don't know that this would be faster than `Mode` (as in nrussell's comment). – Frank Apr 28 '16 at 14:51
  • Thanks Roland & Frank, I will try this out, when you say mode here, are we still using the `Mode <- function(x) { ux <- unique(x) ux[which.max(tabulate(match(x, ux)))] }` that was above, or do I need to change that? How do I provide a reproducible example of a data set that is 17 GB? – Factuary Apr 28 '16 at 14:58
  • 1
    @Factuary Roland is just using Mode to label the output. Standard R syntax, y'know: `c(Name = Value)`. To make a reproducible example, think of data that is easily reproduce that illustrates the problem you're facing; we don't need your actual data nor necessarily all of its idiosyncracies. See http://stackoverflow.com/a/28481250/ for advice. – Frank Apr 28 '16 at 15:01
  • I added a more useful dataset and this solution works for multiple Key columns and multiple mode columns, thank you everyone! `ModeCharacterCols2 <- setkeyv(DT[, .N, by = .(V1, V2, V3, V4, V5, V6, V7, V8)], #tabulate c("V1", "N") #sort by N )[, .(Mode = V5[.N], Mode2=V6[.N], Mode3=V7[.N], Mode4=V8[.N]), by = .(V1, V2, V3, V4)]` Now I just need to try it on my read data set. – Factuary May 03 '16 at 23:11
  • When I adjusted my code to your recommendations, using my posted solution on my real data set, I went from 3.5 hours of run time to 11 minutes. Thank you very much, this is exactly what was needed! – Factuary May 04 '16 at 02:24
0

The fastest solution is the function fmode in the collapse package now available on CRAN. It computes a grouped (and optionally weighted) mode in C++, the speed is very satisfying. Syntax:

fmode(x, g = NULL, w = NULL, ...)

where x can be a vector, matrix, data.frame or dplyr grouped tibble, g is a grouping vector or list of grouping vectors, and w is a weight vector. For mixed-type aggregations, the function collap provides a tidy solution. The call

collap(data, ~ id1 + id2, FUN = fmean, catFUN = fmode, ...)

aggregates the data by id1 and id2, applying the mean to all numeric columns in data and the mode to all non-numeric (categorical) columns in data. By default the data is returned with rows sorted and columns in the original order. This solution is faster than data.table, and the collap call can be futher customized.

Sebastian
  • 1,067
  • 7
  • 12