Let us assume following as my data table data
data <- setDT(structure(list(col1 = c(1, 2, 3, 4, 5), col2 = c(53, 45, 54,
97, 23), col3 = c("aa aa aa aa ab ad af ae ar", "bb bb bb bb bt by bu bi bo",
"cc cc cc cc cd cy ch cn cd", "dd dd dd dd dt dy dj dk da", "ee ee ee ee et eh es er eg"
), col4 = c("aa bb ff ff","aa ff vv rr","dd dd rr gg",
"yy yy rr rr","uu uu uu ee")), .Names = c("col1", "col2", "col3", "col4"),
row.names = c(NA, -5L), class = "data.frame"))
col1 col2 col3 col4
1 53 aa aa aa aa ab ad af ae ar aa bb ff ff
2 45 bb bb bb bb bt by bu bi bo aa ff vv rr
3 54 cc cc cc cc cd cy ch cn cd dd dd rr gg
4 97 dd dd dd dd dt dy dj dk da yy yy rr rr
5 23 ee ee ee ee et eh es er eg uu uu uu ee
col3
has strings of words and I need to find that if the most frequently occurred word
in col3 appears in col4
or not. So output will look like as follows:
col1 col2 col3 col4 most_freq_word_in_cool3 out_col
1 53 aa aa aa aa ab ad af ae ar aa bb ff ff aa 1
2 45 bb bb bb bb bt by bu bi bo aa ff vv rr bb 0
3 54 cc cc cc cc cd cy ch cn cd dd dd rr gg cc 0
4 97 dd dd dd dd dt dy dj dk da yy yy rr rr dd 0
5 23 ee ee ee ee et eh es er eg uu uu uu ee ee 1
I tried the following solution
m_fre_word1 <- function(x) { string <- as.character(unlist(strsplit(x, " ")))
freq <- sort(table(string), decreasing = T)
wr <-names(freq)[1]
return(wr) }
data <- data[ , most_freq_word_in_cool3:= apply(data[ , .(col3)], 1, m_fre_word1)]
data <- data[ , out_col:= as.numeric(grepl(m_fre_word1(col3), col4))]
There is nothing wrong with this solution, but it is really slow. My data table is huge. I can't use this way so I am looking for a faster alternative. Could somebody suggest a faster alternative.
Thanks,