3

Let us assume following as my data table data

data <- setDT(structure(list(col1 = c(1, 2, 3, 4, 5), col2 = c(53, 45, 54, 
97, 23), col3 = c("aa aa aa aa ab ad af ae ar", "bb bb bb bb bt by bu bi bo", 
"cc cc cc cc cd cy ch cn cd", "dd dd dd dd dt dy dj dk da", "ee ee ee ee et eh es er eg"
), col4 = c("aa bb ff ff","aa ff vv rr","dd dd rr gg",
"yy yy rr rr","uu uu uu ee")), .Names = c("col1", "col2", "col3", "col4"), 
row.names = c(NA, -5L), class = "data.frame"))

col1     col2    col3                            col4
1        53      aa aa aa aa ab ad af ae ar      aa bb ff ff
2        45      bb bb bb bb bt by bu bi bo      aa ff vv rr
3        54      cc cc cc cc cd cy ch cn cd      dd dd rr gg
4        97      dd dd dd dd dt dy dj dk da      yy yy rr rr
5        23      ee ee ee ee et eh es er eg      uu uu uu ee

col3 has strings of words and I need to find that if the most frequently occurred word in col3 appears in col4 or not. So output will look like as follows:

col1     col2    col3                            col4          most_freq_word_in_cool3       out_col
1        53      aa aa aa aa ab ad af ae ar      aa bb ff ff             aa                      1  
2        45      bb bb bb bb bt by bu bi bo      aa ff vv rr             bb                      0
3        54      cc cc cc cc cd cy ch cn cd      dd dd rr gg             cc                      0  
4        97      dd dd dd dd dt dy dj dk da      yy yy rr rr             dd                      0 
5        23      ee ee ee ee et eh es er eg      uu uu uu ee             ee                       1

I tried the following solution

    m_fre_word1 <- function(x) { string <- as.character(unlist(strsplit(x, " ")))
                                freq <- sort(table(string), decreasing = T)
                                wr <-names(freq)[1]
                                return(wr) }

    data <- data[ , most_freq_word_in_cool3:= apply(data[ , .(col3)], 1, m_fre_word1)]
    data <- data[ , out_col:= as.numeric(grepl(m_fre_word1(col3), col4))]

There is nothing wrong with this solution, but it is really slow. My data table is huge. I can't use this way so I am looking for a faster alternative. Could somebody suggest a faster alternative.

Thanks,

Frank
  • 66,179
  • 8
  • 96
  • 180
user3664020
  • 2,980
  • 6
  • 24
  • 45

3 Answers3

2

Here's an attempt. Instead of running this whole thing on each row, I would suggest to split the column and operate on a long format.

I'm stealing the Mode function from here and it is defined as follows

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
} 

Now using the newest data.table version we could do

library(data.table) # v 1.9.6+
temp <- setDT(data6)[, tstrsplit(col3, " ", fixed = TRUE)]
data6[, res := melt(temp[, indx := .I], id = "indx")[, Mode(value), by = indx]$V1]
data6

#    col1 col2                       col3        col4 res
# 1:    1   53 aa aa aa aa ab ad af ae ar aa bb ff ff  aa
# 2:    2   45 bb bb bb bb bt by bu bi bo aa ff vv rr  bb
# 3:    3   54 cc cc cc cc cd cy ch cn cd dd dd rr gg  cc
# 4:    4   97 dd dd dd dd dt dy dj dk da yy yy rr rr  dd
# 5:    5   23 ee ee ee ee et eh es er eg uu uu uu ee  ee

The second step can be easily achieved by eitehr

data6[, out := +grepl(res, col4, fixed = TRUE), by = res]

OR (not sure which one is faster)

library(stringi)
data6[stri_detect_fixed(col4, res), out := 1L]

As a side note, when using reference semantics, there is no need to copy the whole data set and reassign it using <-, in fact that's the whole point of reference semantics. Please read this.

Community
  • 1
  • 1
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • Presumably the stringi gives you NAs instead of zeros. Could add a `[, out:=0L]` before it. – Frank Oct 15 '15 at 15:00
  • 1
    @Frank I know, I figured that it's not that important, because it is both easy to fix or OP can just filter by `!is.na(out)` or `sum(out, na.rm = TRUE)` or something similar. – David Arenburg Oct 15 '15 at 17:13
2

I think the apply(data[ , .(col3)] is what is slowing down your code. Generally I have found putting a subset within a data.table call causes huge slowdowns, as the subset operation is expensive.

You can try:

DT[ , test := names(sort(table(strsplit(col3," ")), decreasing = T))[1], by = col1]
DT[, search := gsub(" ","|",col4)]
DT[, output := grepl(search,test), by = col1]

Not sure if my or David's answer will be faster.

EDIT: Based on Frank's input , the last two lines can be replaced with:

DT[, output := mapply(grepl,gsub(" ","|",col4),test)]
Chris
  • 6,302
  • 1
  • 27
  • 54
  • Speed might be improved by skipping the `by` (and creation of a `search` col): `DT[, output := mapply(grepl,gsub(" ","|",col4),test)]`. Also, I guess `tabulate` is faster than `table`, if you can figure how to switch. – Frank Oct 15 '15 at 14:27
  • 1
    Ran microbenchmark and you are right. Getting to one line is the biggest improvement, and then replacing by with mapply adds a tiny bit. Edited to reflect – Chris Oct 15 '15 at 14:35
  • @Chris why do we need `by = col1` in first statement? – user3664020 Oct 15 '15 at 14:48
  • @user3664020 otherwise the table call will run on all elements in col3. You will find the most frequent instance of the entire table vs just the row. – Chris Oct 15 '15 at 14:50
  • @user3664020 If you want to skip the `by` (which should speed it up more), you could do `DT[, test := mapply(Mode, strsplit(col3, " "))]` or `DT[, test := sapply(strsplit(col3, " "), Mode)]` with Mode from David's answer. – Frank Oct 15 '15 at 14:55
  • @Chris i can do `by = col3` also then? Right? – user3664020 Oct 15 '15 at 15:11
  • @user3664020 yes you can, although depending on how many repeating column3's you have it may slow down your code, due to the O(n log n) point made above – Chris Oct 15 '15 at 16:41
1

(Edited as per comments below)

Load the library

require(data.table)

Define the data

x <-
  data.table(
    col1 = c(1, 2, 3, 4, 5),
    col2 = c(53, 45, 54, 97, 23),
    col3 = c(
      "aa aa aa aa ab ad af ae ar", "bb bb bb bb bt by bu bi bo", 
      "cc cc cc cc cd cy ch cn cd", "dd dd dd dd dt dy dj dk da",
      "ee ee ee ee et eh es er eg"),
    col4 = c(
      "aa bb ff ff","aa ff vv rr","dd dd rr gg",
      "yy yy rr rr","uu uu uu ee")
         )

Find the most freqent element in col3

x[,most_freq_word_in_col3:=sapply(col3,function(e){
  names(sort(table(unlist(strsplit(e," "))),decreasing=TRUE)[1])})]

Check whether this element in col4

x[,out_col:=apply(cbind(most_freq_word_in_col3,col4),1,function(e){
  as.numeric(e[1] %in% unlist(strsplit(e[2]," ")))})]

Output:

> x
   col1 col2                       col3        col4 most_freq_word_in_col3 out_col
1:    1   53 aa aa aa aa ab ad af ae ar aa bb ff ff                     aa       1
2:    2   45 bb bb bb bb bt by bu bi bo aa ff vv rr                     bb       0
3:    3   54 cc cc cc cc cd cy ch cn cd dd dd rr gg                     cc       0
4:    4   97 dd dd dd dd dt dy dj dk da yy yy rr rr                     dd       0
5:    5   23 ee ee ee ee et eh es er eg uu uu uu ee                     ee       1
xb.
  • 1,617
  • 11
  • 16
  • @DavidArenburg The second part is different, `%in%` instead of a regex. – Frank Oct 15 '15 at 14:40
  • @Frank I don't think that running an `apply` with margin of one is either necessary or any improvement – David Arenburg Oct 15 '15 at 14:48
  • I got the idea for finding the most frequent element [here](http://stackoverflow.com/questions/17374651/finding-the-most-common-elements-in-a-vector-in-r), which is indeed similar to Chris'. The discussions between @Frank and Chris to speed up the code are interesting and inspiring. I will give it a go. – xb. Oct 15 '15 at 14:49
  • 1
    @DavidArenburg Yeah, it should be `DT[, out_col := mapply("%in%", most_freq_word_in_col3, strsplit(col4, " "))]` or similar, which isn't so bad. Anyway, I'm just saying it's not an exact copy of another answer (which would be worse than being a suboptimal answer). – Frank Oct 15 '15 at 14:52