This will be fairly fast
f1 = function(df, column_name) {
## pre-process words
words = strsplit(df[[column_name]], ",")
uwords = unlist(words)
colnames = unique(uwords)
## pre-allocate result matrix of 'FALSE' values
m = matrix(FALSE, nrow(df), length(colnames), dimnames = list(NULL, colnames))
## update rows and columns of result matrix containing matches to TRUE
row = rep(seq_len(nrow(df)), lengths(words))
col = match(uwords, colnames)
m[cbind(row, col)] = TRUE
## return the final result
cbind(df, m)
}
The trickiest part is that a matrix subset by a two-column matrix treats the first column of the two-column matrix as a row index, and the second column as the column index. So the rows and columns that you want to set to TRUE
are
row = rep(seq_len(nrow(df)), lengths(words))
col = match(uwords, colnames)
and the matrix is updated with
m[ cbind(row, col) ] = TRUE
There is no iteration (e.g., sapply()
), so the match()
function is called once rather than nrow(df)
times.
For 3M rows, I have
> idx = rep(1:3, 1000000)
> df1 = df[idx,]
> system.time(f1(df1, "items"))
user system elapsed
13.304 0.112 13.421
For the other solution (at time of writing) by Christoph:
f0 = function(df, column_name) {
categories_per_row <- strsplit(df[[column_name]], split=",")
categories <- unique(unlist(categories_per_row))
categoryM <- t(sapply(categories_per_row, function(y) categories %in% y))
colnames(categoryM) <- categories
cbind(df, categoryM)
}
and the data.table solution by Uwe (watch out, reference semantics change the value of dt! Also, I don't know how to pass the column name as a function argument):
library(data.table)
dt = df1
dt$no = seq_len(nrow(dt))
f2 = function(dt) {
setDT(dt)
dt1 = dt[, strsplit(items, ","), by = .(no, items)]
dt1[, dcast(.SD, no + items ~ V1, function(x) length(x) > 0)]
}
with times
> system.time(res0 <- f0(df1, "items"))
user system elapsed
23.776 0.000 23.786
> system.time(res2 <- f2(dt, "items"))
Using 'V1' as value column. Use 'value.var' to override
user system elapsed
45.668 0.072 45.593
About 1/2 the time of f1()
is used by strsplit()
; stringr::str_split()
is about two times faster, but since the pattern used to split is fixed (not a regular expression) it makes sense to use strsplit(fixed=TRUE)
, which is about 3x faster. Probably some data.table pro will come up with a very fast solution (but then you need to be a data.table pro...).
It's tempting to do things like 'collapse them [words shared by an item] into a list [actually a vector!]', but it will often be sensible to leave words in a list
> df1$items = strsplit(df1$items, ",", fixed=TRUE)
> head(df1)
no items
1 1 fish, cat, dog
2 2 horse, elephant, dog
3 3 hamster, pig
4 4 fish, cat, dog
5 5 horse, elephant, dog
6 6 hamster, pig
and save yourself the time / trouble required to re-split. The tidyverse way would be to create an extended version of the table
tidyr::unnest(df1)
(or the other approaches in the so-called 'duplicate' question). This would probably cause one to re-think what the role of the columns of logicals is in subsequent manipulations.