I'm summarizing data in a data.table, group by, where I need to take a single value of a variable in a group. I want this value to be the mode of the group. I think it needs to be mode because usually a group is 8 rows and it will have 2 rows at one value and the other 6 or so rows will be another value.
Here's a simplified example, from this:
key1 2
key1 2
key1 2
key1 8
key1 2
key1 2
key1 2
key1 8
I want this:
key1 2
I was having trouble using the standard mode function provided by base R, so I used this solution here: Most frequent value (mode) by group
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
It worked great on my small test data set, but when I run it on my actual data set (22 million rows) it just runs and runs and runs. All my other data.table operations that are similar work great and really fast, but I'm not using a UDF. This is the structure of my data.table query:
ModeCharacterColumns <- ExposureHistory[,lapply(.SD,Mode), .(Key1=Key1, Key2=Key2, ..., key7=key7, key8=key8), .SDcols=('col1','col2','col3', ..., 'col53')]
So I'm guessing my problem is that my UDF is really slowing things down, does anyone have any suggestions where I can accomplish the same goal but get it done much quicker?
Thank you everyone!
EDIT: Better representation of the data:
DT <- fread("key1A key2A key3A key4A 2 2 4 s
key1A key2A key3A key4A 2 2 4 s
key1A key2A key3A key4A 8 8 8 t
key1A key2A key3A key4A 2 2 4 s
key1B key2B key3B key4B 6 6 6 v
key1B key2B key3B key4B 2 2 5 t
key1B key2B key3B key4B 2 2 5 v
key1B key2B key3B key4B 2 2 5 v")
And the desired result:
result <- fread("key1A key2A key3A key4A 2 2 4 s
key1B key2B key3B key4B 2 2 5 v")