apply a function to each element of a column individually

Question

Let us assume following as my data table data

data <- setDT(structure(list(col1 = c(1, 2, 3, 4, 5), col2 = c(53, 45, 54, 
97, 23), col3 = c("aa aa aa aa ab ad af ae ar", "bb bb bb bb bt by bu bi bo", 
"cc cc cc cc cd cy ch cn cd", "dd dd dd dd dt dy dj dk da", "ee ee ee ee et eh es er eg"
), col4 = c("aa bb ff ff","aa ff vv rr","dd dd rr gg",
"yy yy rr rr","uu uu uu ee")), .Names = c("col1", "col2", "col3", "col4"), 
row.names = c(NA, -5L), class = "data.frame"))

col1     col2    col3                            col4
1        53      aa aa aa aa ab ad af ae ar      aa bb ff ff
2        45      bb bb bb bb bt by bu bi bo      aa ff vv rr
3        54      cc cc cc cc cd cy ch cn cd      dd dd rr gg
4        97      dd dd dd dd dt dy dj dk da      yy yy rr rr
5        23      ee ee ee ee et eh es er eg      uu uu uu ee

col3 has strings of words and I need to find that if the most frequently occurred word in col3 appears in col4 or not. So output will look like as follows:

col1     col2    col3                            col4          most_freq_word_in_cool3       out_col
1        53      aa aa aa aa ab ad af ae ar      aa bb ff ff             aa                      1  
2        45      bb bb bb bb bt by bu bi bo      aa ff vv rr             bb                      0
3        54      cc cc cc cc cd cy ch cn cd      dd dd rr gg             cc                      0  
4        97      dd dd dd dd dt dy dj dk da      yy yy rr rr             dd                      0 
5        23      ee ee ee ee et eh es er eg      uu uu uu ee             ee                       1

I tried the following solution

    m_fre_word1 <- function(x) { string <- as.character(unlist(strsplit(x, " ")))
                                freq <- sort(table(string), decreasing = T)
                                wr <-names(freq)[1]
                                return(wr) }

    data <- data[ , most_freq_word_in_cool3:= apply(data[ , .(col3)], 1, m_fre_word1)]
    data <- data[ , out_col:= as.numeric(grepl(m_fre_word1(col3), col4))]

There is nothing wrong with this solution, but it is really slow. My data table is huge. I can't use this way so I am looking for a faster alternative. Could somebody suggest a faster alternative.

Thanks,

To make this easily reproducible, post the `dput` of the data frame. — Pierre L, Oct 15 '15 at 13:46
`dput(data)` It will look like `structure(....(data.table))` — Pierre L, Oct 15 '15 at 13:49
Here's a suggestion: finding your most common element is currently done by sorting (in decreasing order) and getting the first element. This has time complexity of O(n log n). You can do it in O(n) time and space. — YuppieNetworking, Oct 15 '15 at 13:57
@YuppieNetworking but strings of `col3` are not that big. Do you think sorting is the most time consuming part? — user3664020, Oct 15 '15 at 13:59
@user3664020 Ah, I thought that your real data had much more elements in `col3`. For lower n values, this is not going to be the bottleneck — YuppieNetworking, Oct 15 '15 at 14:01

score 2 · Accepted Answer · edited May 23 '17 at 11:51

Here's an attempt. Instead of running this whole thing on each row, I would suggest to split the column and operate on a long format.

I'm stealing the Mode function from here and it is defined as follows

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

Now using the newest data.table version we could do

library(data.table) # v 1.9.6+
temp <- setDT(data6)[, tstrsplit(col3, " ", fixed = TRUE)]
data6[, res := melt(temp[, indx := .I], id = "indx")[, Mode(value), by = indx]$V1]
data6

#    col1 col2                       col3        col4 res
# 1:    1   53 aa aa aa aa ab ad af ae ar aa bb ff ff  aa
# 2:    2   45 bb bb bb bb bt by bu bi bo aa ff vv rr  bb
# 3:    3   54 cc cc cc cc cd cy ch cn cd dd dd rr gg  cc
# 4:    4   97 dd dd dd dd dt dy dj dk da yy yy rr rr  dd
# 5:    5   23 ee ee ee ee et eh es er eg uu uu uu ee  ee

The second step can be easily achieved by eitehr

data6[, out := +grepl(res, col4, fixed = TRUE), by = res]

OR (not sure which one is faster)

library(stringi)
data6[stri_detect_fixed(col4, res), out := 1L]

As a side note, when using reference semantics, there is no need to copy the whole data set and reassign it using <-, in fact that's the whole point of reference semantics. Please read this.

Presumably the stringi gives you NAs instead of zeros. Could add a `[, out:=0L]` before it. — Frank, Oct 15 '15 at 15:00
@Frank I know, I figured that it's not that important, because it is both easy to fix or OP can just filter by `!is.na(out)` or `sum(out, na.rm = TRUE)` or something similar. — David Arenburg, Oct 15 '15 at 17:13

Chris · Answer 2 · 2015-10-15T14:38:17.900

2

I think the apply(data[ , .(col3)] is what is slowing down your code. Generally I have found putting a subset within a data.table call causes huge slowdowns, as the subset operation is expensive.

You can try:

DT[ , test := names(sort(table(strsplit(col3," ")), decreasing = T))[1], by = col1]
DT[, search := gsub(" ","|",col4)]
DT[, output := grepl(search,test), by = col1]

Not sure if my or David's answer will be faster.

EDIT: Based on Frank's input , the last two lines can be replaced with:

DT[, output := mapply(grepl,gsub(" ","|",col4),test)]

edited Oct 15 '15 at 14:38

answered Oct 15 '15 at 14:12

Chris

6,302
1
27
54

Speed might be improved by skipping the `by` (and creation of a `search` col): `DT[, output := mapply(grepl,gsub(" ","|",col4),test)]`. Also, I guess `tabulate` is faster than `table`, if you can figure how to switch. – Frank Oct 15 '15 at 14:27
1

Ran microbenchmark and you are right. Getting to one line is the biggest improvement, and then replacing by with mapply adds a tiny bit. Edited to reflect – Chris Oct 15 '15 at 14:35
@Chris why do we need `by = col1` in first statement? – user3664020 Oct 15 '15 at 14:48
@user3664020 otherwise the table call will run on all elements in col3. You will find the most frequent instance of the entire table vs just the row. – Chris Oct 15 '15 at 14:50
@user3664020 If you want to skip the `by` (which should speed it up more), you could do `DT[, test := mapply(Mode, strsplit(col3, " "))]` or `DT[, test := sapply(strsplit(col3, " "), Mode)]` with Mode from David's answer. – Frank Oct 15 '15 at 14:55
@Chris i can do `by = col3` also then? Right? – user3664020 Oct 15 '15 at 15:11
@user3664020 yes you can, although depending on how many repeating column3's you have it may slow down your code, due to the O(n log n) point made above – Chris Oct 15 '15 at 16:41

xb. · Answer 3 · 2015-10-15T14:42:32.827

1

(Edited as per comments below)

Load the library

require(data.table)

Define the data

x <-
  data.table(
    col1 = c(1, 2, 3, 4, 5),
    col2 = c(53, 45, 54, 97, 23),
    col3 = c(
      "aa aa aa aa ab ad af ae ar", "bb bb bb bb bt by bu bi bo", 
      "cc cc cc cc cd cy ch cn cd", "dd dd dd dd dt dy dj dk da",
      "ee ee ee ee et eh es er eg"),
    col4 = c(
      "aa bb ff ff","aa ff vv rr","dd dd rr gg",
      "yy yy rr rr","uu uu uu ee")
         )

Find the most freqent element in col3

x[,most_freq_word_in_col3:=sapply(col3,function(e){
  names(sort(table(unlist(strsplit(e," "))),decreasing=TRUE)[1])})]

Check whether this element in col4

x[,out_col:=apply(cbind(most_freq_word_in_col3,col4),1,function(e){
  as.numeric(e[1] %in% unlist(strsplit(e[2]," ")))})]

Output:

> x
   col1 col2                       col3        col4 most_freq_word_in_col3 out_col
1:    1   53 aa aa aa aa ab ad af ae ar aa bb ff ff                     aa       1
2:    2   45 bb bb bb bb bt by bu bi bo aa ff vv rr                     bb       0
3:    3   54 cc cc cc cc cd cy ch cn cd dd dd rr gg                     cc       0
4:    4   97 dd dd dd dd dt dy dj dk da yy yy rr rr                     dd       0
5:    5   23 ee ee ee ee et eh es er eg uu uu uu ee                     ee       1

edited Oct 15 '15 at 14:42

answered Oct 15 '15 at 14:22

xb.

1,617
11
16

@DavidArenburg The second part is different, `%in%` instead of a regex. – Frank Oct 15 '15 at 14:40
@Frank I don't think that running an `apply` with margin of one is either necessary or any improvement – David Arenburg Oct 15 '15 at 14:48
I got the idea for finding the most frequent element [here](http://stackoverflow.com/questions/17374651/finding-the-most-common-elements-in-a-vector-in-r), which is indeed similar to Chris'. The discussions between @Frank and Chris to speed up the code are interesting and inspiring. I will give it a go. – xb. Oct 15 '15 at 14:49
1

@DavidArenburg Yeah, it should be `DT[, out_col := mapply("%in%", most_freq_word_in_col3, strsplit(col4, " "))]` or similar, which isn't so bad. Anyway, I'm just saying it's not an exact copy of another answer (which would be worse than being a suboptimal answer). – Frank Oct 15 '15 at 14:52

apply a function to each element of a column individually

3 Answers3